How Much Computational Power Does It Take to Match the Human Brain?

Published: September 11, 2020 | by Joseph Carlsmith

Open Philanthropy is interested in when AI systems will be able to perform various tasks that humans can perform (“AI timelines”). To inform our thinking, I investigated what evidence the human brain provides about the computational power sufficient to match its capabilities. This is the full report on what I learned. A medium-depth summary is available here. The executive summary below gives a shorter overview.

1 Introduction

1.1 Executive summary

Let’s grant that in principle, sufficiently powerful computers can perform any cognitive task that the human brain can. How powerful is sufficiently powerful? I investigated what we can learn from the brain about this. I consulted with more than 30 experts, and considered four methods of generating estimates, focusing on floating point operations per second (FLOP/s) as a metric of computational power.

These methods were:

Estimate the FLOP/s required to model the brain’s mechanisms at a level of detail adequate to replicate task-performance (the“mechanistic method”).1
Identify a portion of the brain whose function we can already approximate with artificial systems, and then scale up to a FLOP/s estimate for the whole brain (the “functional method”).
Use the brain’s energy budget, together with physical limits set by Landauer’s principle, to upper-bound required FLOP/s (the “limit method”).
Use the communication bandwidth in the brain as evidence about its computational capacity (the “communication method”). I discuss this method only briefly.

None of these methods are direct guides to the minimum possible FLOP/s budget, as the most efficient ways of performing tasks need not resemble the brain’s ways, or those of current artificial systems. But if sound, these methods would provide evidence that certain budgets are, at least, big enough (if you had the right software, which may be very hard to create – see discussion in section 1.3).2

Here are some of the numbers these methods produce, plotted alongside the FLOP/s capacity of some current computers.

**Figure 1: The report’s main estimates.** See the conclusion for a list that describes them in more detail, and summarizes my evaluation of each.

These numbers should be held lightly. They are back-of-the-envelope calculations, offered alongside initial discussion of complications and objections. The science here is very far from settled.

For those open to speculation, though, here’s a summary of what I’m taking away from the investigation:

Mechanistic method estimates suggesting that 10¹³-10¹⁷ FLOP/s is enough to match the human brain’s task-performance seem plausible to me. This is partly because various experts are sympathetic to these estimates (others are more skeptical), and partly because of the direct arguments in their support. Some considerations from this method point to higher numbers; and some, to lower numbers. Of these, the latter seem to me stronger.3
I give less weight to functional method estimates, primarily due to uncertainties about (a) the FLOP/s required to fully replicate the functions in question, (b) what the relevant portion of the brain is doing (in the case of the visual cortex), and (c) differences between that portion and the rest of the brain (in the case of the retina). However, I take estimates based on the visual cortex as some weak evidence that the mechanistic method range above (10¹³-10¹⁷ FLOP/s) isn’t much too low. Some estimates based on recent deep neural network models of retinal neurons point to higher numbers, but I take these as even weaker evidence.
I think it unlikely that the required number of FLOP/s exceeds the bounds suggested by the limit method. However, I don’t think the method itself airtight. Rather, I find some arguments in the vicinity persuasive (though not all of them rely directly on Landauer’s principle); various experts I spoke to (though not all) were quite confident in these arguments; and other methods seem to point to lower numbers.
Communication method estimates may well prove informative, but I haven’t vetted them. I discuss this method mostly in the hopes of prompting further work.

Overall, I think it more likely than not that 10¹⁵ FLOP/s is enough to perform tasks as well as the human brain (given the right software, which may be very hard to create). And I think it unlikely (<10%) that more than 10²¹ FLOP/s is required.4 But I’m not a neuroscientist, and there’s no consensus in neuroscience (or elsewhere).

I offer a few more specific probabilities, keyed to one specific type of brain model, in the appendix.5 My current best-guess median for the FLOP/s required to run that particular type of model is around 10¹⁵ (note that this is not an estimate of the FLOP/s uniquely “equivalent” to the brain – see discussion in section 1.6).

As can be seen from the figure above, the FLOP/s capacities of current computers (e.g., a V100 at ~10¹⁴ FLOP/s for ~$10,000, the Fugaku supercomputer at ~4×10¹⁷ FLOP/s for ~$1 billion) cover the estimates I find most plausible.6 However:

Computers capable of matching the human brain’s task performance would also need to meet further constraints (for example, constraints related to memory and memory bandwidth).
Matching the human brain’s task-performance requires actually creating sufficiently capable and computationally efficient AI systems, and I do not discuss how hard this might be (though note that training an AI system to do X, in machine learning, is much more resource-intensive than using it to do X once trained).7

So even if my best-guesses are right, this does not imply that we’ll see AI systems as capable as the human brain anytime soon.

Acknowledgements: This report emerged out of Open Philanthropy’s engagement with some arguments suggested by one of our technical advisors, Dario Amodei, in the vein of the mechanistic/functional methods (see citations throughout the report for details). However, my discussion should not be treated as representative of Dr. Amodei’s views; the project eventually broadened considerably; and my conclusions are my own. My thanks to Dr. Amodei for prompting the investigation, and to Open Philanthropy’s technical advisors Paul Christiano and Adam Marblestone for help and discussion with respect to different aspects of the report. I am also grateful to the following external experts for talking with me. In neuroscience: Stephen Baccus, Rosa Cao, E.J. Chichilnisky, Erik De Schutter, Shaul Druckmann, Chris Eliasmith, davidad (David A. Dalrymple), Nick Hardy, Eric Jonas, Ilenna Jones, Ingmar Kanitscheider, Konrad Kording, Stephen Larson, Grace Lindsay, Eve Marder, Markus Meister, Won Mok Shim, Lars Muckli, Athanasia Papoutsi, Barak Pearlmutter, Blake Richards, Anders Sandberg, Dong Song, Kate Storrs, and Anthony Zador. In other fields: Eric Drexler, Owain Evans, Michael Frank, Robin Hanson, Jared Kaplan, Jess Riedel, David Wallace, and David Wolpert. My thanks to Dan Cantu, Nick Hardy, Stephen Larson, Grace Lindsay, Adam Marblestone, Jess Riedel, and David Wallace for commenting on early drafts (or parts of early drafts) of the report; to six other neuroscientists (unnamed) for reading/commenting on a later draft; to Ben Garfinkel, Catherine Olsson, Chris Sommerville, and Heather Youngs for discussion; to Nick Beckstead, Ajeya Cotra, Allan Dafoe, Tom Davidson, Owain Evans, Katja Grace, Holden Karnofsky, Michael Levine, Luke Muehlhauser, Zachary Robinson, David Roodman, Carl Shulman, Bastian Stern, and Jacob Trefethen for valuable comments and suggestions; to Charlie Giattino, for conducting some research on the scale of the human brain; to Asya Bergal for sharing with me some of her research on Landauer’s principle; to Jess Riedel for detailed help with the limit method section; to AI Impacts for sharing some unpublished research on brain-computer equivalence; to Rinad Alanakrih for help with image permissions; to Robert Geirhos, IEEE, and Sage Publications for granting image permissions; to Jacob Hilton and Gregory Toepperwein for help estimating the FLOP/s costs of different models; to Hannah Aldern and Anya Grenier for help with recruitment; to Eli Nathan for extensive help with the website and citations; to Nik Mitchell, Andrew Player, Taylor Smith, and Josh You for help with conversation notes; and to Nick Beckstead for guidance and support throughout the investigation.

1.2 Caveats

(This section discusses some caveats about the report’s epistemic status, and some notes on presentation. Those eager for the main content, however uncertain, can skip to section 1.3.)

Some caveats:

Little if any of the evidence surveyed in this report is particularly conclusive. My aim is not to settle the question, but to inform analysis and decision-making that must proceed in the absence of conclusive evidence, and to lay groundwork for future work.
I am not an expert in neuroscience, computer science, or physics (my academic background is in philosophy).
I sought out a variety of expert perspectives, but I did not make a rigorous attempt to ensure that the experts I spoke to were a representative sample of opinion in the field. Various selection effects influencing who I interviewed plausibly correlate with sympathy towards lower FLOP/s requirements.8
For various reasons, the research approach used here differs from what might be expected in other contexts. Key differences include:
- I give weight to intuitions and speculations offered by experts, as well as to factual claims by experts that I have not independently verified (these are generally documented in conversation notes approved by the experts themselves).
- I report provisional impressions from initial research.
- My literature reviews on relevant sub-topics are not comprehensive.
- I discuss unpublished papers where they appear credible.
- My conclusions emerge from my own subjective synthesis of the evidence I engaged with.
There are ongoing questions about the baseline reliability of various kinds of published research in neuroscience and cognitive science.9 I don’t engage with this issue explicitly, but it is an additional source of uncertainty.

A few other notes on presentation:

I have tried to keep the report accessible to readers with a variety of backgrounds.
The endnotes are frequent and sometimes lengthy, and they contain more quotes and descriptions of my research process than is academically standard. This is out of an effort to make the report’s reasoning transparent to readers. However, the endnotes are not essential to the main content, and I suggest only reading them if you’re interested in more details about a particular point.
I draw heavily on non-verbatim notes from my conversations with experts, made public with their approval and cited/linked in endnotes. These notes are also available here.
I occasionally use the word “compute” as a shorthand for “computational power.”
Throughout the rest of the report, I use a form of scientific notation, in which “XeY” means “X×10^Y.” Thus, 1e6 means 1,000,000 (a million); 4e8 means 400,000,000 (four hundred million); and so on. I also round aggressively.

1.3 Context

(This section briefly describes what prompts Open Philanthropy’s interest in the topic of this report. Those primarily interested in the main content can skip to Section 1.4.)

This report is part of a broader effort at Open Philanthropy to investigate when advanced AI systems might be developed (“AI timelines”) – a question that we think decision-relevant for our grant-making related to potential risks from advanced AI.10 But why would an interest in AI timelines prompt an interest in the topic of this report in particular?

Some classic analyses of AI timelines (notably, by Hans Moravec and Ray Kurzweil) emphasize forecasts about when available computer hardware will be “equivalent,” in some sense (see section 1.6 for discussion), to the human brain.11

**Figure 2: Graph schema for classic forecasts**. See real examples here and here.

A basic objection to predicting AI timelines on this basis alone is that you need more than hardware to do what the brain does.12 In particular, you need software to run on your hardware, and creating the right software might be very hard (Moravec and Kurzweil both recognize this, and appeal to further arguments).13

In the context of machine learning, we can offer a more specific version of this objection: the hardware required to run an AI system isn’t enough; you also need the hardware required to train it (along with other resources, like data).14 And training a system requires running it a lot. DeepMind’s AlphaGo Zero, for example, trained on ~5 million games of Go.15

Note, though, that depending on what sorts of task-performance will result from what sorts of training, a framework for thinking about AI timelines that incorporated training requirements would start, at least, to incorporate and quantify the difficulty of creating the right software more broadly.16 This is because training turns computation, data, and other resources into software you wouldn’t otherwise know how to make.

What’s more, the hardware required to train a system is related to the hardware required to run it.17 This relationship is central to Open Philanthropy’s interest in the topic of this report, and to an investigation my colleague Ajeya Cotra has been conducting, which draws on my analysis. That investigation focuses on what brain-related FLOP/s estimates, along with other estimates and assumptions, might tell us about when it will be feasible to train different types of AI systems. I don’t discuss this question here, but it’s an important part of the context. And in that context, brain-related hardware estimates play a different role than they do in forecasts like Moravec’s and Kurzweil’s.

1.4 FLOP/s basics

(This section discusses what FLOP/s are, and why I chose to focus on them. Readers familiar with FLOP/s and happy with this choice can skip to Section 1.5.)

Computational power is multidimensional – encompassing, for example, the number and type of operations performed per second, the amount of memory stored at different levels of accessibility, and the speed with which information can be accessed and sent to different locations.18

This report focuses on operations per second, and in particular, on “floating point operations.”19 These are arithmetic operations (addition, subtraction, multiplication, division) performed on a pair of floating point numbers – that is, numbers represented as a set of significant digits multiplied by some other number raised to some exponent (like scientific notation). I’ll use “FLOPs” to indicate floating point operations (plural), and “FLOP/s” to indicate floating point operations per second.

My central reason for focusing on FLOP/s is that various brain-related FLOP/s estimates are key inputs to the framework for thinking about training requirements, mentioned above, that my colleague Ajeya Cotra has been investigating, and they were the focus of Open Philanthropy’s initial exploration of this topic, out of which this report emerged. Focusing on FLOP/s in particular also limits the scope of what is already a fairly broad investigation; and the availability of FLOP/s is one key contributor to recent progress in AI.20

Still, the focus on FLOP/s is a key limitation of this analysis, as other computational resources are just as crucial to task-performance: if you can’t store the information you need, or get it where it needs to be fast enough, then the units in your system that perform FLOPs will be some combination of useless and inefficiently idle.21 Indeed, my understanding is that FLOP/s are often not the relevant bottleneck in various contexts related to AI and brain modeling.22 And further dimensions an AI system’s implementation, like hardware architecture, can introduce significant overheads, both in FLOP/s and other resources.23

Ultimately, though, once other computational resources are in place, and other overheads have mostly been eliminated or accounted for, you need to actually perform the FLOP/s that a given time-limited computation requires. In order to isolate this quantity, I proceed on the idealizing assumption that non-FLOP resources are available in amounts adequate to make full use of all of the FLOP/s in question (but not in unrealistically extreme abundance), without significant overheads.24 All talk of the “FLOP/s sufficient to X” assumes this caveat.

This means you can’t draw conclusions about which concrete computers can replicate human-level task performance directly from the FLOP/s estimates in this report, even if you think those estimates credible. Such computers will need to meet further constraints.25

Note, as well, that these estimates do not depend on the assumption that the brain performs operations analogous to FLOPs, or on any other similarities between brain architectures and computer architectures.26 The report assumes that the tasks the brain performs can also be performed using a sufficient number of FLOP/s, but the causal structure in the brain that gives rise to task-performance could in principle take a wide variety of unfamiliar forms.

1.5 Neuroscience basics

(This section reviews some of the neural mechanisms I’ll be discussing, in an effort to make the report’s content accessible to readers without a background in neuroscience.27 Those familiar with signaling mechanisms in the brain – neurons, neuromodulators, gap junctions – can skip to Section 1.5.1).

The human brain contains around 100 billion neurons, and roughly the same number of non-neuronal cells.28 Neurons are cells specialized for sending and receiving various types of electrical and chemical signals, and other non-neuronal cells send and receive signals as well.29 These signals allow the brain, together with the rest of the nervous system, to receive and encode sensory information from the environment, to process and store this information, and to output the complex, structured motor behavior constitutive of task performance.30

**Figure 3: Diagram of a neuron**. From OpenStax, “Anatomy and Physiology”, Section 12.2, unaltered. Licensed under CC BY 4.0.

We can divide a typical neuron into three main parts: the soma, the dendrites, and the axon.31 The soma is the main body of the cell. The dendrites are extensions of the cell that branch off from the soma, and which typically receive signals from other neurons. The axon is a long, tail-like projection from the soma, which carries electrical impulses away from the cell body. The end of the axon splits into branches, the ends of which are known as axon terminals, which reach out to connect with other cells at locations called synapses. A typical synapse forms between the axon terminal of one neuron (the presynaptic neuron) and the dendrite of another (the postsynaptic neuron), with a thin zone of separation between them known as the synaptic cleft.32

The cell as a whole is enclosed in a membrane that has various pumps that regulate the concentration of certain ions – such as sodium (Na⁺), potassium (K⁺) and chloride (Cl^–) – inside it.33 This regulation creates different concentrations of these ions inside and outside the cell, resulting in a difference in the electrical potential across the membrane (the membrane potential).34 The membrane also contains proteins known as ion channels, which, when open, allow certain types of ions to flow into and out of the cell.35

If the membrane potential in a neuron reaches a certain threshold, then a particular set of voltage-gated ion channels open to allow ions to flow into the cell, creating a temporary spike in the membrane potential (an action potential).36 This spike travels down the axon to the axon terminals, where it causes further voltage-gated ion channels to open, allowing an influx of calcium ions into the pre-synaptic axon terminal. This calcium can trigger the release of molecules known as neurotransmitters, which are stored in sacs called vesicles in the axon terminal.37

These vesicles merge with the cell membrane at the synapse, allowing the neurotransmitter they contain to diffuse across the synaptic cleft and bind to receptors on the post-synaptic neuron. These receptors can cause (directly or indirectly, depending on the type of receptor) ion channels on the post-synaptic neuron to open, thereby altering the membrane potential in that area of that cell.38

**Figure 4: Diagram of synaptic communication.** From OpenStax, “Anatomy and Physiology”, Section 12.5, unaltered. Licensed under CC BY 4.0.39

The expected size of the impact (excitatory or inhibitory) that a spike through a synapse will have on the post-synaptic membrane potential is often summarized via a parameter known as a synaptic weight.40 This weight changes on various timescales, depending on the history of activity in the pre-synaptic and post-synaptic neuron, together with other factors. These changes, along with others that take place within synapses, are grouped under the term synaptic plasticity.41 Other changes also occur in neurons on various timescales, affecting the manner in which neurons respond to synaptic inputs (some of these changes are grouped under the term intrinsic plasticity).42 New synapses, dendritic spines, and neurons also grow over time, and old ones die.43

There are also a variety of other signaling mechanisms in the brain that this basic story does not include. For example:

Other chemical signals: Neurons can also send and receive other types of chemical signals – for example, molecules known as neuropeptides, and gases like nitric oxide – that can diffuse more broadly through the space in between cells, across cell membranes, or via the blood.44 The chemicals neurons release that influence the activity of groups of neurons (or other cells) are known as neuromodulators.45
Glial cells: Non-neuronal cells in the brain known as glia have traditionally been thought to mostly perform functions to do with maintenance of brain function, but they may be involved in task-performance as well.46
Electrical synapses: In addition to the chemical synapses discussed above, there are also electrical synapses that allow direct, fast, and bi-directional exchange of electrical signals between neurons (and between other cells). The channels mediating this type of connection are known as gap junctions.
Ephaptic effects: Electrical activity in neurons creates electric fields that may impact the electrical properties of neighboring neurons.47
Other forms of axon signaling: The process of firing an action potential has traditionally been thought of as a binary decision.48 However, some recent evidence indicates that processes within a neuron other than “to fire or not to fire” can matter for synaptic communication.49
Blood flow: Blood flow in the brain correlates with neural activity, which has led some to suggest that it might be playing a role in information-processing.50

This is not a complete list of all the possible signaling mechanisms that could in principle be operative in the brain.51 But these are some of the most prominent.

1.5.1 Uncertainty in neuroscience

I want to emphasize one other meta-point about neuroscience: namely, that our current understanding of how the brain processes information is extremely limited.52 This was a consistent theme in my conversations with experts, and one of my clearest take-aways from the investigation as a whole.53

One problem is that we need better tools. For example:

Despite advances, we can only record the spiking activity of a limited number of neurons at the same time (techniques like fMRI and EEG are much lower resolution).54
We can’t record from all of a neuron’s synapses or dendrites simultaneously, making it hard to know what patterns of overall synaptic input and dendritic activity actually occur in vivo.55
We also can’t stimulate all of a neuron’s synapses and/or dendrites simultaneously, making it hard to know how the cell responds to different inputs (and hence, which models can capture these responses).56
Techniques for measuring many lower-level biophysical mechanisms and processes, such as possible forms of ion channel plasticity, remain very limited.57
Results in model animals may not generalize to e.g. humans.58
Results obtained in vitro (that is, in a petri dish) may not generalize in vivo (that is, in a live functioning brain).59
The tasks we can give model animals like rats to perform are generally very simple, and so provide limited evidence about more complex behavior.60

Tools also constrain concepts. If we can’t see or manipulate something, it’s unlikely to feature in our theories.61 And certain models of e.g. neurons may receive scant attention simply because they are too computation-intensive to work with, or too difficult to constrain with available data.62

But tools aren’t the only problem. For example, when Jonas and Kording (2017) examined a simulated 6502 microprocessor – a system whose processing they could observe and manipulate to arbitrary degrees – using analogues of standard neuroscientific approaches, they found that “the approaches reveal interesting structure in the data but do not meaningfully describe the hierarchy of information processing in the microprocessor” (p. 1).63 And artificial neural networks that perform complex tasks are difficult (though not necessarily impossible) to interpret, despite similarly ideal experimental access.64

We also don’t know what high-level task most neural circuits are performing, especially outside of peripheral sensory/motor systems. This makes it very hard to say what models of such circuits are adequate.65

It would help if we had full functional models of the nervous systems of some simple animals. But we don’t.66 For example, the nematode worm Caenorhabditis elegans (C. elegans) has only 302 neurons, and a map of the connections between these neurons (the connnectome) has been available since 1986.67 But we have yet to build a simulated C. elegans that behaves like the real worm across a wide range of contexts.68

All this counsels pessimism about the robustness of FLOP/s estimates based on our current neuroscientific understanding. And it increases the relevance of where we place the burden of proof. If we start with a strong default view about the complexity of the brain’s task-performance, and then demand proof to the contrary, our standards are unlikely to be met.

Indeed, my impression is that various “defaults” in this respect play a central role in how experts approach this topic. Some take simple models that have had some success as a default, and then ask whether we have strong reason to think additional complexity necessary;69 others take the brain’s biophysical complexity as a default, and then ask if we have strong reason to think that a given type of simplification captures everything that matters.70

Note the distinction, though, between how we should do neuroscience, and how we should bet now about where such science will ultimately lead, assuming we had to bet. The former question is most relevant to neuroscientists; but the latter is what matters here.

1.6 Clarifying the question

Consider the set of cognitive tasks that the human brain can perform, where task performance is understood as the implementation of a specified type of relationship between a set of inputs and a set of outputs.71 Examples of such tasks might include:

Reading an English-language description of a complex software problem, and, within an hour, outputting code that solves that problem.72
Reading a randomly selected paper submitted to the journal Nature, and, within a week, outputting a review of the paper of quality comparable to an average peer-reviewer.73
Reading newly-generated Putnam Math competition problems, and, within six hours, outputting answers that would receive a perfect score by standard judging criteria.74

Defining tasks precisely can be arduous. I’ll assume such precision is attainable, but I won’t try to attain it, since little in what follows depends on the details of the tasks in question. I’ll also drop the adjective “cognitive” in what follows.

I will also assume that sufficiently powerful computers can in principle perform these tasks (I focus solely on non-quantum computers – see endnote for discussion of quantum brain hypotheses).75 This assumption is widely shared both within the scientific community and beyond it. Some dispute it, but I won’t defend it here.76

The aim of the report is to evaluate the extent to which the brain provides evidence, for some number of FLOP/s F, that for any task T that the human brain can perform, T can be performed with F.77 As a proxy for FLOP/s numbers with this property, I will sometimes talk about the FLOP/s sufficient to run a “task-functional model,” by which I mean a computational model that replicates a generic human brain’s task-performance. Of course, some brains can do things others can’t, but I’ll assume that at the level of precision relevant to this report, human brains are roughly similar, and hence that if F FLOP/s is enough to replicate the task performance of a generic human brain, roughly F is enough to replicate any task T the human brain can perform.78

The project here is related to, but distinct from, directly estimating the minimum FLOP/s sufficient to perform any task the brain can perform. Here’s an analogy. Suppose you want to build a bridge across the local river, and you’re wondering if you have enough bricks. You know of only one such bridge (the “old bridge”), so it’s natural to look there for evidence. If the old bridge is made of bricks, you could count them. If it’s made of something else, like steel, you could try to figure out how many bricks you need to do what a given amount of steel does. If successful, you’ll end up confident that e.g. 100,000 bricks is enough to build such a bridge, and hence that the minimum is less than this. But how much less is still unclear. You studied an example bridge, but you didn’t derive theoretical limits on the efficiency of bridge-building.

That said, Dr. Paul Christiano expected there to be at least some tasks such (a) the brain’s methods of performing them are close to maximally efficient, and (b) these methods use most of the brain’s resources (see endnote).79 I don’t investigate this claim here, but if true, it would make data about the brain more directly relevant to the minimum adequate FLOP/s budget.

The project here is also distinct from estimating the FLOP/s “equivalent” to the human brain. As I discuss in the report’s appendix, I think the notion of “the FLOP/s equivalent to the brain” requires clarification: there are a variety of importantly different concepts in the vicinity.

To get a flavor of this, consider the bridge analogy again, but assume that the old bridge is made of steel. What number of bricks would be “equivalent” to the old bridge? The question seems ill-posed. It’s not that bridges can’t be built from bricks. But we need to say more about what we want to know.

I group the salient possible concepts of the “FLOP/s equivalent to the human brain” into four categories:

FLOP/s required for task-performance, with no further constraints on how the tasks need to be performed.80
FLOP/s required for task-performance + brain-like-ness constraints – that is, constraints on the similarity between how the AI system does it, and how the brain does it.
FLOP/s required for task-performance + findability constraints – that is, constraints on what sorts of training processes and engineering efforts would be able to create the AI system in question.
Other analogies with human-engineered computers.

All these categories have their own problems (see section A.5 for a summary chart). The first is closest to the report’s focus, but as just noted, it’s hard (at least absent further assumptions) to estimate directly using example systems. The second faces the problem of identifying a non-arbitrary brain-like-ness constraint that picks out a unique number of FLOP/s, without becoming too much like the first. The third brings in a lot of additional questions about what sorts of systems are what sorts of findable. And the fourth, I suggest, either collapses into the first or second, or raises its own questions.

In the hopes of avoiding some of these problems, I have kept the report’s framework broad. The brain-based FLOP/s budgets I’m interested in don’t need to be uniquely “equivalent” to the brain, or as small as theoretically possible, or accommodating of any constraints on brain-like-ness or findability. They just need to be big enough, in principle, to perform the tasks in question.

A few other clarifications:

Properties construed as consisting in something other than the implementation of a certain type of input-output relationship (for example, properties like phenomenal consciousness, moral patienthood, or continuity with a particular biological human’s personal identity – to the extent they are so construed) are not included in the definition of the type of task-performance I have in mind. Systems that replicate this type of task-performance may or may not also possess such properties, but what matters here are inputs and outputs.81
Many tasks require more than a brain. For example, they may require something like a body, or rely partly on information-processing taking place outside the brain.82 In those cases, I’m interested in the FLOP/s sufficient to replicate the brain’s role.

1.7 Existing literature

(This section reviews existing literature.83 Those interested primarily in the report’s substantive content can skip to Section 2.)

A lot of existing research is relevant to estimating the FLOP/s sufficient to run a task-functional model. But efforts in the mainstream academic literature to address this question directly are comparatively rare (a fact that this report does not alter). Many existing estimates are informal, and they often do not attempt much justification of their methods or background assumptions. The specific question they consider also varies, and their credibility varies widely.84

1.7.1 Mechanistic method estimates

The most common approach assigns a unit of computation (such as a calculation, a number of bits, or a possibly brain-specific operation) to a spike through a synapse, and then estimates the rate of spikes through synapses by multiplying an estimate of the average firing rate by an estimate of the number of synapses.85 Thus, Merkle (1989),86 Mead (1990),87 Freitas (1996),88 Sarpeshkar (1997),89 Bostrom (1998),90 Kurzweil (1999)),91 Dix (2005),92 Malickas (2007),93 and Tegmark (2017)94 are all variations on this theme.95 Their estimates range from ~1e12 to ~1e17 (though using basic different units of computation),96 but the variation results mainly from differences in estimated synapse count and average firing rate, rather than differences in substantive assumptions about how to make estimates of this kind.97 In this sense, the helpfulness of these estimates is strongly correlated: if the basic approach is wrong, none of them are a good guide.

Other estimates use a similar approach, but include more complexity. Sarpeshkar (2010) includes synaptic conductances (see discussion in section 2.1.1.2.2), learning, and firing decisions in a lower bound estimate (6e16 FLOP/s);98 Martins et al. (2012) estimate the information-processing rate of different types of neurons in different regions, for a total of ~5e16 bits/sec in the whole brain;99 and Kurzweil (2005) offers an upper bound estimate for a personality-level simulation of 1e19 calculations per second – an estimate that budgets 1e3 calculations per spike through synapse to capture nonlinear interactions in dendrites.100 Still others attempt estimates based on protein interactions (Thagard (2002), 1e21 calculations/second);101 microtubules (Tuszynski (2006), 1e21 FLOP/s),102 individual neurons (von Neumann (1958), 1e11 bits/second);103 and possible computations performed by dendrites and other neural mechanisms (Dettmers (2015), 1e21 FLOP/s).104

A related set of estimates comes from the literature on brain simulations. Ananthanarayanan et al. (2009) estimates >1e18 FLOP/s to run a real-time human brain simulation;105 Waldrop (2012) cites Henry Markram as estimating 1e18 FLOP/s to run a very detailed simulation;106 Markram, in a 2018 video (18:28), estimates that you’d need ~4e29 FLOP/s to run a “real-time molecular simulation of the human brain”;107 and Eugene Izhikevich estimates that a real-time brain simulation would require ~1e6 processors running at 384 GHz.108

Sandberg and Bostrom (2008) also estimate the FLOP/s requirements for brain emulations at different levels of detail. Their estimates range from 1e15 FLOP/s for an “analog network population model,” to 1e43 FLOP/s for emulating the “stochastic behavior of single molecules.”109 They report that in an informal poll of attendees at a workshop on whole brain emulation, the consensus appeared to be that the required level of resolution would fall between “Spiking neural network” (1e18 FLOP/s), and “Metabolome” (1e25 FLOP/s).110

Despite their differences, I group all of these estimates under the broad heading of the “mechanistic method,” as all of them attempt to identify task-relevant causal structure in the brain’s biological mechanisms, and quantify it in some kind of computational unit.

1.7.2 Functional method estimates

A different class of estimates focus on the FLOP/s sufficient to replicate the function of some portion of the brain, and then attempt to scale up to an estimate for the brain as a whole (the “functional method”). Moravec (1988), for example, estimates the computation required to do what the retina does (1e9 calculations/second) and then scales up (1e14 calc/s).111 Merkle (1989) performs a similar retina-based calculation and gets 1e12-1e14 ops/sec.112

Kurzweil (2005) offers a functional method estimate (1e14 calcs/s) based on work by Lloyd Watts on sound localization,113 another (1e15 calcs/s) based on an cerebellar simulation at the University of Texas;114 and a third (1e14 calcs/s), in his 2012 book, based on the FLOP/s he estimates is required to emulate what he calls a “pattern recognizer” in the neocortex.115 Drexler (2019) uses the FLOP/s required for various deep learning systems (specifically: Google’s Inception architecture, Deep Speech 2, and Google’s neural machine translation model) to generate various estimates he takes to suggest that 1e15 FLOP/s is sufficient to match the brain’s functional capacity.116

1.7.3 Limit method estimates

Sandberg (2016) uses Landauer’s principle to generate an upper bound of ~2e22 irreversible operations per second in the brain – a methodology I consider in more detail in Section 4.117 De Castro (2013) estimates a similar limit, also from Landauer’s principle, on perceptual operations performed by the parts of the brain involved in rapid, automatic inference (1e23 operations per second).118 I have yet to encounter other attempts to bound the brain’s overall computation via Landauer’s principle,119 though many papers discuss related issues in the brain and in biological systems more broadly.120

1.7.4 Communication method estimates

AI Impacts estimates the communication capacity of the brain (measured as “traversed edges per second” or TEPS), then combines this with an observed ratio of TEPS to FLOP/s in some human-engineered computers, to arrive an estimate of brain FLOP/s (~1e16-3e17 FLOP/s).121 I discuss methods in this broad category – what I call, the “communication method” – in Section 5.

Let’s turn now to evaluating the methods themselves. Rather than looking at all possible ways of applying them, my discussion will focus on what seem to me like the most plausible approaches I’m aware of, and the most important arguments/objections.

2 The mechanistic method

The first method I’ll be discussing – the “mechanistic method” – attempts to estimate the computation required to model the brain’s biological mechanisms at a level of detail adequate to replicate task performance.

Simulating the brain in extreme detail would require enormous amounts of computational power.122 Which details would need to be included in a computational model, and which, if any, could be left out or summarized?

The approach I’ll pursue focuses on signaling between cells. Here, the idea is that for a process occurring in a cell to matter to task-performance, it needs to affect the type of signals (e.g. neurotransmitters, neuromodulators, electrical signals at gap junctions, etc.) that cell sends to other cells.123 Hence, a model of that cell that replicates its signaling behavior (that is, the process of receiving signals, “deciding” what signals to send out, and sending them) would replicate the cell’s role in task-performance, even if it leaves out or summarizes many other processes occuring in the cell. Do that for all the cells in the brain involved in task-performance, and you’ve got a task-functional model.

I’ll divide the signaling processes that might need to be modeled into three categories:

Standard neuron signaling.124 I’ll divide this into two parts:
- Synaptic transmission. The signaling process that occurs at a chemical synapse as a result of a spike.
- Firing decisions. The processes that cause a neuron to spike or not spike, depending on input from chemical synapses and other variables.
Learning. Processes involved in learning and memory formation (e.g., synaptic plasticity, intrinsic plasticity, and growth/death of cells and synapses), where not covered by (1).
Other signaling mechanisms. Any other signaling mechanisms (neuromodulation, electrical synapses, ephaptic effects, glial signaling, etc.) not covered by (1) or (2).

As a first-pass framework, we can think of synaptic transmission as a function from spiking inputs at synapses to some sort of output impact on the post-synaptic neuron; and of firing decisions as (possibly quite complex) functions that take these impacts as inputs, and then produce spiking outputs – outputs which themselves serve as inputs to downstream synaptic transmission. Learning changes these functions over time (though it can involve other changes as well, like growing new neurons and synapses). Other signaling mechanisms do other things, and/or complicate this basic picture.

**Figure 5: Basic framework I use for the mechanistic method.**

This isn’t an ideal carving, but hopefully it’s helpful regardless.125 Here’s the mechanistic method formula that results:

Total FLOP/s = FLOP/s for standard neuron signaling +
FLOP/s for learning +
FLOP/s for other signaling mechanisms

I’m particularly interested in the following argument:

You can capture standard neuron signaling and learning with somewhere between ~1e13-1e17 FLOP/s overall.
This is the bulk of the FLOP/s burden (other processes may be important to task-performance, but they won’t require comparable FLOP/s to capture).

I’ll discuss why one might find (I) and (II) plausible in what follows. I don’t think it at all clear that these claims are true, but they seem plausible to me, partly on the merits of various arguments I’ll discuss, and partly because some of the experts I engaged with were sympathetic (others were less so). I also discuss some ways this range could be too high, and too low.

2.1 Standard neuron signaling

Here is the sub-formula for standard neuron signaling:

FLOP/s for standard neuron signaling = FLOP/s for synaptic transmission + FLOP/s for firing decisions

I’ll budget for each in turn.

2.1.1 Synaptic transmission

Let’s start with synaptic transmission. This occurs as a result of spikes through synapses, so I’ll base this budget on spikes through synapses per second × FLOPs per spike through synapse (I discuss some assumptions this involves below).

2.1.1.1 Spikes through synapses per second

How many spikes through synapses happen per second?

As noted above, the human brain has roughly 100 billion neurons.126 Synapse count appears to be more uncertain,127 but most estimates I’ve seen fall in the range of an average of 1,000-10,000 synapses per neuron, and between 1e14 and 1e15 overall.128

How many spikes arrive at a given synapse per second, on average?

Maximum neuron firing rates can exceed 100 Hz,129 but in vivo recordings suggest that neurons usually fire at lower rates – between 0.01 and 10 Hz.130
Experts I engaged with tended to use average firing rates of 1-10 Hz.131
Energy costs limit spiking. Lennie (2003), for example, uses energy costs to estimate a 0.16 Hz average in the cortex, and 0.94 Hz “using parameters that all tend to underestimate the cost of spikes.”132 He also estimates that “to sustain an average rate of 1.8 spikes/s/neuron would use more energy than is normally consumed by the whole brain” (13 Hz would require more than the whole body).133
Existing recording methods may bias towards active cells.134 Shoham et al. (2005), for example, suggests that recordings may overlook large numbers of “silent” neurons that fire infrequently (on one estimate for the cat primary visual cortex, >90% of neurons may qualify as “silent”).135

Synthesizing evidence from a number of sources, AI Impacts offers a best guess average of 0.1-2 Hz. This sounds reasonable to me (I give most weight to the metabolic estimates). I’ll use 0.1-1 Hz, partly because Lennie (2003) treats 0.94 Hz as an overestimate, and partly because I’m mostly sticking with order-of-magnitude level precision. This suggests an overall range of ~1e13-1e15 spikes through synapses per second (1e14-1e15 synapses × 0.1-1 spikes per second).136

Note that many of the mechanistic method estimates reviewed in 1.6.1 assume a higher average spiking rate, often in the range of 100 Hz.137 For the reasons listed above, I think 100 Hz too high. ~10 Hz seems more possible (though it requires Lennie (2003) to be off by 1-2 orders of magnitude, and my best guess is lower): in that case, we’d add an orders of magnitude to the high-end estimates below.

2.1.1.2 FLOPs per spike through synapse

How many FLOPs do we need to capture what matters about the signaling that occurs when a spike arrives at a synapse?

2.1.1.2.1 A simple model

A simple answer is: one FLOP. Why might one think this?

One argument is that in the context of standard neuron signaling (setting aside learning), what matters about a spike through a synapse is that it increases or decreases the post-synaptic membrane potential by a certain amount, corresponding to the synaptic weight. This could be modeled as a single addition operation (e.g., add the synaptic weight to the post-synaptic membrane potential). That is, one FLOP (of some precision, see below).138

We can add several complications without changing this picture much:139

Some estimates treat a spike through a synapse as multiplication by a synaptic weight. But spikes are binary, so in a framework based on individual spikes, you’re really only “multiplying” the synaptic weight by 0 or 1 (e.g., if the neuron spikes, then multiply the weight by 1, and add it to the post-synaptic membrane potential; otherwise, multiply it by 0, and add the result – 0 – to the post-synaptic membrane potential).
In artificial neural networks, input neuron activations are sometimes analogized to non-binary spike rates (e.g., average numbers of spikes over some time interval), which are multiplied by synaptic weights and then summed.140 This would be two FLOPs (or one Multiply-Accumulate). But since such rates take multiple spikes to encode, this analogy plausibly suggests less than two FLOPs per spike through synapse.

How precise do these FLOPs need to be?141 That depends on the number of distinguishable synaptic weights/membrane potentials. Here are some relevant estimates:

Koch (1999) suggests “between 6 and 7 bits of resolution” for variables like neuron membrane potential.142
Bartol et al. (2015) suggest a minimum of “4.7 bits of information at each synapse” (they don’t estimate a maximum).143
Sandberg and Bostrom (2008) cite evidence for ~1 bit, 3-5 bits, and 0.25 bits stored at each synapse.144
Zador (2019) suggests “a few” bits/synapse to specify graded synaptic strengths.145
Lahiri and Ganguli (2013) suggest that the number of distinguishable synaptic strengths can be “as small as two”146 (though they cite Enoki et al. (2009) as indicating greater precision).147

A standard FLOP is 32 bits, and half-precision is 16 – well in excess of these estimates. Some hardware uses even lower-precision operations, which may come closer. I’d guess that 8 bits would be adequate.

If we assume 1 (8-bit) FLOP per spike through synapse, we get an overall estimate of 1e13-1e15 (8-bit) FLOP/s for synaptic transmission. I won’t continue to specify the precision I have in mind in what follows.

2.1.1.2.2 Possible complications

Here are a few complications this simple model leaves out.

Stochasticity

Real chemical synaptic transmission is stochastic. Each vesicle of neurotransmitter has a certain probability of release, conditional on a spike arriving at the synapse, resulting in variation in synaptic efficacy across trials.148 This isn’t necessarily a design defect. Noise in the brain may have benefits,149 and we know that the brain can make synapses reliable.150

Would capturing the contribution of this stochasticity to task performance require many extra FLOP/s, relative to a deterministic model? My guess is no.

The relevant probability distribution (a binomial distribution, according to Siegelbaum et al. (2013c), (p. 270)), appears to be fairly simple, and Dr. Paul Christiano, one of our technical advisors, thought that sampling from an approximation of such a distribution would be cheap.151
My background impression is that in designing systems for processing information, adding noise is easy; limiting noise is hard (though this doesn’t translate directly into a FLOPs number).
Despite the possible benefits of noise, my guess is that the brain’s widespread use of stochastic synapses has a lot to do with resource constraints (more reliable synapses require more neurotransmitter release sites).152
Many neural network models don’t include this stochasticity.153

That said, one expert I spoke with (Prof. Erik De Schutter) thought it an open question whether the brain manipulates synaptic stochasticity in computationally complex ways.154

Synaptic conductances

The ease with which ions can flow into the post-synaptic cell at a given synapse (also known as the synaptic conductance) changes over time as the ion channels activated by synaptic transmission open and close.155 The simple “addition” model above doesn’t include this – rather, it summarizes the impact of a spike through synapse as a single, instantaneous increase or decrease to post-synaptic membrane potential.

Sarpeshkar (2010), however, appears to treat the temporal dynamics of synaptic conductances as central to the computational function of synapses.156 He assumes, as a lower bound, that “the 20 ms second-order filter response due to each synapse is 40 FLOPs,” and that such operations occur on every spike.157

I’m not sure exactly what Sarpeshkar (2010) has in mind here, but it seems plausible to me that the temporal dynamics of a neuron’s synaptic conductances can influence membrane potential, and hence spike timing, in task-relevant ways.158 One expert also emphasized the complications to neuron behavior introduced by the conductance created by a particular type of post-synaptic receptor called an NMDA-receptor – conductances that Beniaguev et al. (2020) suggest may substantially increase the complexity of a neuron’s I/O (see discussion in Section 2.1.1.2).159 That said, two experts thought it likely that synaptic conductances could either be summarized fairly easily or left out entirely.160

Sparse FLOPs and time-steps per synapse

Estimates based on spikes through synapses assume that you don’t need to budget any FLOPs for when a synapse doesn’t receive a spike, but could have. Call this the “sparse FLOPs assumption.”161 In current neural network implementations, the analogous situation (e.g., artificial neuron activations of 0) creates inefficiencies, which some new hardware designs aim to avoid.162 But this seems more like an engineering challenge than a fundamental feature of the brain’s task-performance.

Note, though, that for some types of brain simulation, budgets would be based on time-steps per synapse instead, regardless of what is actually happening at synapse over that time. Thus, for a simulation of a 1e14-1e15 synapses run at 1 ms resolution (1000 timesteps per second), you’d get 1e17-1e18 timesteps per synapse – a number that would then be multiplied by your FLOPs budget per time-step at each synapse; and smaller time-steps would yield higher numbers. Not all brain simulations do this (see, e.g., Ananthanarayanan et al. (2009), who simulate time-steps at neurons, but events at synapse),163 but various experts use it as a default methodology.164

Going forward, I’ll assume that on simple models of synaptic transmission where the synaptic weight is not changing during time-steps without spikes, we don’t need to budget any FLOPs for those time-steps (the budgets for different forms of synaptic plasticity are different story, and will be covered in the learning section). If this is wrong, though, it could increase budgets by a few orders of magnitude (see Section 2.4.1).

Others

There are likely many other candidate complications that the simple model discussed above does not include. There is intricate molecular machinery located at synapses, much of which is still not well-understood. Some of this may play a role in synaptic plasticity (see Section 2.2 below), or just in maintaining a single synaptic weight (itself a substantive task), but some may be relevant to standard neuron signaling as well.165

Higher-end estimate

I’ll use 100 FLOPs per spike through synapse as a higher-end FLOP/s budget for synaptic transmission. This would at least cover Sarpeshkar’s 40 FLOP estimate, and provide some cushion for other things I might be missing, including some more complex manipulations of synaptic stochasticity.

With 1 FLOP per spike through synapse as a low-end, and 100 FLOPs as a high end, we get 1e13-1e17 FLOP/s overall. Firing rate models might suggest lower numbers; other complexities and unknowns, along with estimates based on time-steps rather than spikes, higher numbers.

2.1.2 Firing decisions

The other component of standard neuron signaling is firing decisions, understood as mappings from synaptic inputs to spiking outputs.

One might initially think these likely irrelevant: there are 3-4 orders of magnitude more synapses than neurons, so one might expect events at synapses to dominate the FLOP/s burden.166 But as just noted, we’re counting FLOPs at synapses based on spikes, not time-steps. Depending on the temporal-resolution we use (this varies across models), the number of time-steps per second (often ≥1000) plausibly exceeds the average firing rate (~0.1-1 Hz) by 3-4 orders of magnitude as well. Thus, if we need to compute firing decisions every time-step, or just generally more frequently than the average firing rate, this could make up for the difference between neuron and synapse count (I discuss this more in Section 2.1.2.5). And firing decisions could be more complex than synaptic transmission for other reasons as well.

Neuroscientists implement firing decisions using neuron models that can vary enormously in their complexity and biological realism. Herz et al. (2006) group these models into five rough categories:167

Detailed compartmental models. These attempt detailed reconstruction of a neuron’s physical structure and the electrical properties of its dendritic tree. This tree is modeled using many different “compartments” that can each have different membrane potentials.
Reduced compartmental models. These include fewer distinct compartments, but still more than one.
Single compartment models. These ignore the spatial structure of the neuron entirely and focus on the impact of input currents on the membrane potential in a single compartment.
1. The Hodgkin-Huxley model, a classic model in neuroscience, is a paradigm example of a single compartment model. It models different ionic conductances in the neuron using a series of differential equations. According to Izhikevich (2004), it requires ~120 FLOPs per 0.1 ms of simulation – ~1e6 FLOP/s overall.168
2. My understanding is that “integrate-and-fire”-type models – another classic neuron model, but much more simplified – would also fall into this category. Izhikevich (2004) suggests that these require ~5-13 FLOPs per ms per cell, 5000-13,000 FLOP/s overall.169
Cascade models. These models abstract away from ionic conductances, and instead attempt to model a neuron’s input-output mapping using a series of higher-level linear and non-linear mathematical operations, together with sources of noise. The “neurons” used in contemporary deep learning can be seen as variants of models in this category.170 These cascade models can also incorporate operations meant to capture transformations of synaptic inputs that occur in dendrites.171
Black box models. These neglect biological mechanisms altogether.

Prof. Erik De Schutter also mentioned that greater computing power has made even more biophysically realistic models available.172 And models can in principle be arbitrarily detailed.

Which of these models (if any) would be adequate to capture what matters about firing decisions? I’ll consider four categories of evidence: the predictive success of different neuron models; some specific arguments about the computational power of dendrites; a collection of other considerations; and expert opinion/practice.

2.1.2.1 Predicting neuron behavior

Let’s first look at the success different models have had in predicting neuron spike patterns.

2.1.2.1.1 Standards of accuracy

How accurate do these predictions need to be? The question is still open.

In particular, debate in neuroscience continues about whether and when to focus on spike rates (e.g., the average number of spikes over a given period), vs. the timings of individual spikes.173

Many results in neuroscience focus on rates,174 as do certain neural prostheses.175
In some contexts, it’s fairly clear that spike timings can be temporally precise.176
One common argument for rates appeals to variability in a neuron’s response to repeated exposure to the same stimulus.177 My impression is that this argument is not straightforward to make rigorous, but it seems generally plausible to me that if rates are less variable than timings, they are also better suited to information-processing.178
A related argument is that in networks of artificial spiking neurons, adding a single spike results in very different overall behavior.179 This plausibly speaks against very precisely-timed spiking in the brain, since the brain is robust to forms of noise that can shift spike timings180 as well as to our adding spikes to biological networks.181

My current guess is that in many contexts, but not all, spike rates are sufficient.

Even if we settled this debate, though, we’d still need to know how accurately the relevant rates/timings would need to be predicted.182 Here, a basic problem is that in many cases, we don’t know what tasks a neuron is involved in performing, or what role it’s playing. So we can’t validate a model by showing that it suffices to reproduce a given neuron’s role in task-performance – the test we actually care about.183

In the absence of such validation, one approach is to try to limit the model’s prediction error to within the trial-by-trial variability exhibited by the biological neuron.184 But if you can’t identify and control all task-relevant inputs to the cell, it’s not always clear what variability is or is not task-relevant.185

Nor is it clear how much progress a given degree of predictive success represents.186 Consider an analogy with human speech. I might be able to predict many aspects of human conversation using high-level statistics about common sounds, volume variations, turn-taking, and so forth, without actually being able to replicate or generate meaningful sentences. Neuron models with some predictive success might be similarly off the mark (and similar meanings could also presumably be encoded in different ways: e.g., “hello,” “good day,” “greetings,” etc.).187

2.1.2.1.2 Existing results

With these uncertainties in mind, let’s look at some existing efforts to predict neuron spiking behavior with computational models (these are only samples from a very large literature, which I do not attempt to survey).188

Many of these come with important additional caveats:

Many model in vitro neuron behavior, which may differ from in vivo behavior in important ways.189
Some use simpler models to predict the behavior of more detailed models. But we don’t really know how good the detailed models are, either.190
We are very limited in our ability to collect in vivo data about the spatio-temporal input patterns at dendrites. This makes it hard to tell how models respond to realistic input patterns.191 And we know that certain behaviors (for example, dendritic non-linearities) are only triggered by specific input patterns.192
We can’t stimulate neurons with arbitrary input patterns. This makes it hard to test their full range of behavior.193
Models that predict spiking based on current injection into the soma skip whatever complexity might be involved in capturing processing that occurs in dendrites.194

A number of the results I looked at come from the retina, a thin layer of neural tissue in the eye, responsible for the first stage of visual processing. This processing is largely (though not entirely) feedforward:195 the retina receives light signals via a layer of ~100 million photoreceptor cells (rods and cones),196 processes them in two further cell layers, and sends the results to the rest of the brain via spike patterns in the optic nerve – a bundle of roughly a million axons of neurons called retinal ganglion cells.197

Figure 6: Diagram of the retina. From Dowling (2007), unaltered. Licensed under CC BY-SA 3.0.198

I focused on the retina in particular partly because it’s the subject of a prominent functional method estimate in the literature (see Section 3.1.1), and partly because it offers advantages most other neural circuits don’t: we know, broadly, what task it’s performing (initial visual processing); we know what the relevant inputs (light signals) and outputs (optic nerve spike trains) are; and we can measure/manipulate these inputs/outputs with comparative ease.199 That said, as I discuss in Section 3.1.2, it may also be an imperfect guide to the brain as a whole.

Here’s a table with various modeling results that purport to have achieved some degree of success. Most of these I haven’t investigated in detail, and don’t have a clear sense of the significance of the quoted results. And as I discuss in later sections, some of the deep neural network models (e.g., Beniaguev et al. (2020), Maheswaranathan et al. (2019), Batty et al. (2017)) are very FLOP/s intensive (~1e7-1e10 FLOP/s per cell).200 A more exhaustive investigation could estimate the FLOP/s costs of all the listed models, but I won’t do that here.

SOURCE	MODEL TYPE	THING PREDICTED	STIMULI	RESULTS
Beniaguev et al. (2020)	Temporally convolutional network with 7 layers and 128 channels per layer	Spike timing and membrane potential of a detailed model of a Layer 5 cortical pyramidal cell	Random synaptic inputs	“accurately, and very efficiently, capture[s] the I/O of this neuron at the millisecond resolution … For binary spike prediction (Fig. 2D), the AUC is 0.9911. For somatic voltage prediction (Fig. 2E), the RMSE is 0.71mV and 94.6% of the variance is explained by this model”
Maheswaranathan et al. (2019)	Three-layer convolutional neural network	Retinal ganglion cell (RGC) spiking in isolated salamander retina	Naturalistic images	>0.7 correlation coefficient (retinal reliability is 0.8)
Ujfalussy et al. (2018)	Hierarchical cascade of linear-nonlinear subunits	Membrane potential of in-vivo validated biophysical model of L2/3 pyramidal cell	In vivo-like input patterns	“Linear input integration with a single global dendritic nonlinearity achieved above 90% prediction accuracy.”
Batty et al. (2017)	Shared two-layer recurrent network	RGC spiking in isolated primate retina	Natural images	80% of explainable variance.
2016 talk (39:05) by Markus Meister	Linear-non-linear	RGC spiking (not sure of experimental details)	Naturalistic movie	80% correlation with real response (cross-trial correlation of real responses was around 85-90%).
Naud et al. (2014)	Two compartments, each modeled with a pair of non-linear differential equations and a small number of parameters that approximate the Hodgkin-Huxley equations	In vitro spike timings of layer 5 pyramidal cell	Noisy current injection into the soma and apical dendrite	“The predicted spike trains achieved an averaged coincidence rate of 50%. The scaled coincidence rate obtained by dividing by the intrinsic reliability (Jolivet et al. (2008a); Naud and Gerstner (2012b)) was 72%, which is comparable to the state-of-the performance for purely somatic current injection which reaches up to 76% (Naud et al. (2009)).”
Bomash et al. (2013)	Linear-non-linear	RGC spiking in isolated mouse retina	Naturalistic and artificial	“the model cells carry the same amount of information,” “the quality of the information is the same.”
Nirenberg and Pandarinath (2012)	Linear-non-linear	RGC spiking in isolated mouse retina	Natural scenes movie	“The firing patterns … closely match those of the normal retina,”; brain would map the artificial spike trains to the same images “90% of the time.”
Naud and Gerstner (2012a)	Review of a number of simplified neuron models, including Adaptive Exponential Integrate and Fire (AdEx) and Spike Response Model (SRM)	In vitro spike timings of various neuron types	Simulating realistic conditions in vitro by injecting a fluctuating current into the soma	“Performances are very close to optimal,” considering variation in real neuron responses. “For models like the AdEx or the SRM, [the percentage of predictable spikes predicted] ranged from 60% to 82% for pyramidal neurons, and from 60% to 100% for fast-spiking interneurons.”
Gerstner and Naud (2009)	Threshold model	In vivo spiking activity of neuron in the lateral geniculate nucleus (LGN)	Visual stimulation of the retina	Predicted 90.5% of spiking activity
Gerstner and Naud (2009)	Integrate-and-fire model with moving threshold	In vitro spike timings of (a) a pyramidal cell, and (b) an interneuron	Random current injection	59.6% of pyramidal cell spikes, 81.6% of interneuron spikes.
Song et al. (2007)	Multi-input multi-output model	Spike trains in the CA3 region of the rat hippocampus while it was performing a memory task	Input spike trains recorded from rat hippocampus	“The model predicts CA3 output on a msec-to-msec basis according to the past history (temporal pattern) of dentate input, and it does so for essentially all known physiological dentate inputs and with approximately 95% accuracy.”
Pillow et al. (2005)	Leaky integrate and fire model	RGC spiking in in vitro macaque retina	Artificial (“pseudo-random stimulus”)	“The fitted model predicts the detailed time structure of responses to novel stimuli, accurately capturing the interaction between the spiking history and sensory stimulus selectivity.”
Brette and Gerstner (2005)	Adaptive Exponential Integrate-and-fire Model	Spike timings for detailed, conductance-based neuron model	Injection of noisy synaptic conductances	“Our simple model predicts correctly the timing of 96% of the spikes (+/- 2 ms)…”
Rauch et al. (2003)	Integrate-and-fire model with spike-frequency-dependent adaptation/facilitation	In vitro firing of rat neocortical pyramidal cells	In vivo-like noisy current injection into the soma.	“the integrate-and-fire model with spike-frequency- dependent adaptation /facilitation is an adequate model reduction of cortical cells when the mean spike frequency response to in vivo–like currents with stationary statistics is considered.”
Poirazi et al. (2003)	Two-layer neural network	Detailed biophysical model of a pyramidal neuron	“An extremely varied, spatially heterogeneous set of synaptic activation patterns”	94% of variance explained (a single-layer network explained 82%)
Keat et al. (2001)	Linear-non-linear	RGC spiking in salamander and rabbit isolated retinas, and retina/LGN spiking in anesthetized cat	Artificial (“random flicker stimulus’)	“The simulated spike trains are about as close to the real spike trains as the real spike trains are across trials.”

Figure 7: List of some efforts to predict neuron behavior that appear to have had some amount of success.

What should we take away from these results? Without much of an understanding of the details here, my current high-level take-away is that it seems like some models do pretty well in some conditions, but in many cases, these conditions aren’t clearly informative about in vivo behavior across the brain, and absent better functional understanding and experimental access, it’s hard to say exactly what level of predictive accuracy is required, in response to what types of inputs. There are also incentives to present research in an optimistic light, and contexts in which our models do much worse won’t have ended up on the list (though note, as well, that additional predictive accuracy need not require additional FLOP/s – it may be that we just haven’t found the right models yet).

Let’s look at some other considerations.

2.1.2.2 Dendritic computation

Some neuron models don’t include dendrites. Rather, they treat dendrites as directly relaying synaptic inputs to the soma.

A common objection to such models is that dendrites can do more than this.201 For example:

The passive membrane properties of dendrites (e.g. resistance, capacitance, and geometry) can create nonlinear interactions between synaptic inputs.202
Active, voltage-dependent channels can create action potentials within dendrites, some of which can backpropagate through the dendritic tree.203

Effects like these are sometimes called “dendritic computation.”204

My impression is that the importance of dendritic computation to task-performance remains somewhat unclear: many results are in vitro, and some may require specific patterns of synaptic input.205 That said, one set of in vivo measurements found very active dendrites: specifically, dendritic spike rates 5-10x larger than somatic spike rates,206 which the authors take to suggest that dendritic spiking might dominate the brain’s energy consumption.207 Energy is scarce, so if true, this would suggest that dendritic spikes are important for something. And dendritic dynamics appear to be task-relevant in a number of neural circuits.208

How many extra FLOP/s do you need to capture dendritic computation, relative to “point neuron models” that don’t include dendrites? Some considerations suggest fairly small increases:

A number of experts thought that models incorporating a small number of additional dendritic sub-units or compartments would likely be adequate.209
It may be possible to capture what matters about dendritic computation using a “point neuron” model.210
Some active dendritic mechanisms may function to “linearize” the impact at the soma of synaptic inputs that would otherwise decay, creating an overall result that looks more like direct current injection.211
Successful efforts to predict neuron responses to task-relevant inputs (e.g., retinal responses to natural movies) would cover dendritic computation automatically (though at least some prominent forms of dendritic computation don’t happen in the retina).212

Tree structure

One of Open Philanthropy’s technical advisors (Dr. Dario Amodei) also suggests a more general constraint. Many forms of dendritic computation, he suggests, essentially amount to non-linear operations performed on sums of subsets of a neuron’s synaptic inputs.213 Because dendrites are structured as a branching tree, the number of such non-linearities cannot exceed the number of inputs,214 and thus the FLOP/s costs they can impose is limited.215 Feedbacks created by active dendritic spiking could complicate this picture, but the tree structure will still limit communication between branches. Various experts I spoke with were sympathetic to this kind of argument,216 though one was skeptical.217

Here’s a toy illustration of this idea.218 Consider a point neuron model that adds up 1000 synaptic inputs, and then passes them through a non-linearity. To capture the role of dendrites, you might modify this model by adding, say, 10 dendritic subunits, each performing a non-linearity on the sum of 100 synaptic inputs, the outputs of which are summed at the soma and then passed through a final non-linearity (multi-layer approaches in this broad vicinity are fairly common).219

**Figure 8: Contrasting a point neuron model with a tree-structured dendritic sub-unit model.**

If we budget 1 FLOP per addition operation, and 10 per non-linearity (this is substantial overkill for certain non-linearities, like a ReLU),220 we get the following budgets:

Point neuron model:
Soma: 1000 FLOPs (additions) + 10 FLOPs (non-linearity)
Total: 1010 FLOPs
Sub-unit model:
Dendrites: 10 (subunits) × (100 FLOPs (additions) + 10 FLOPs (non-linearity))
Soma: 10 FLOPs (additions) + 10 FLOPs (non-linearity)
Total: 1120 FLOPs

The totals aren’t that different (in general, the sub-unit model requires 11 additional FLOPs per sub-unit), even if the sub-unit model can do more interesting things. And if the tree-structure caps the number of non-linearities (and hence, sub-units) at the number of inputs, then the maximum increase is a factor of ~11×.221 This story would alter if, for example, subunits could be fully connected, with each receiving all synaptic inputs, or all the outputs from subunits in a previous layer. But this fits poorly with a tree structured physiology.

Note, though, that the main upshot of this argument is that dendritic non-linearities won’t add that much computation relative to a model that budgets 1 FLOP per input connection per time-step. Our budget for synaptic transmission above, however, was based on spikes through synapses per second, not time-steps per synapse per second. In that context, if we assume that dendritic non-linearities need to be computed every time-step, then adding e.g. 100 or 1000 extra dendritic non-linearities per neuron could easily increase our FLOP/s budget by 100 or 1000x (see endnote for an example).222 That said, my impression is that many actual ANN models of dendritic computation use fewer sub-units, and it may be possible to avoid computing firing decisions/dendritic non-linearities every time-step as well – see brief discussion in section 2.1.2.5.

Cortical neurons as deep neural networks

What about evidence for larger FLOP/s costs from dendritic computation? One interesting example is Beniaguev et al. (2020), who found that they needed a very large deep neural network (7 layers, 128 channels per layer) to accurately predict the outputs of a detailed biophysical model of a cortical neuron, once they added conductances from a particular type of receptor (NMDA receptors).223 Without these conductances, they could do it with a much smaller network (a fully connected DNN with 128 hidden units and only one hidden layer), suggesting that it’s the dynamics introduced by NMDA-conductances in particular, as opposed to the behavior of the detailed biophysical model more broadly, that make the task hard.224

This 7-layer network requires a lot of FLOPs: roughly 2e10 FLOP/s per cell.225 Scaled up by 1e11 neurons, this would be ~2e21 FLOP/s overall. And these numbers could yet be too small: perhaps you need greater temporal/spatial resolution, greater prediction accuracy, a more complex biophysical model, etc., not to mention learning and other signaling mechanisms, in order to capture what matters.

I think that this is an interesting example of positive evidence for very high FLOP/s estimates. But I don’t treat it as strong evidence on its own. This is partly out of general caution about updating on single studies (or even a few studies) I haven’t examined in depth, especially in a field as uncertain as neuroscience. But there are also a few more specific ways these numbers could be too high:

It may be possible to use a smaller network, given a more thorough search. Indeed, the authors suggest that this is likely, and have made data available to facilitate further efforts.226
They focus on predicting both membrane potential and individual spikes very precisely.
This is new (and thus far unpublished) work, and I’m not aware of other results of this kind.

The authors also suggest an interestingly concrete way to validate their hypothesis: namely, teach a cortical L5 pyramidal neuron to implement a function that this kind of 7-layer network can implement, such as classifying handwritten digits.227 If biological neurons can perform useful computational tasks thought to require very large neural networks to perform, this would indeed be very strong evidence for capacities exceeding what simple models countenance.228 That said, “X is needed to predict the behavior of Y” does not imply that “Y can do anything X can do” (consider, for example, a supercomputer and a hurricane).

Overall, I think that dendritic computation is probably the largest source of uncertainty about the FLOP/s costs of firing decisions. I find the Beniaguev et al. (2020) results suggestive of possible lurking complexity; but I’m also moved somewhat by the relative simplicity of some common abstract models of dendritic computation, by the tree-structure argument above, and by experts who thought dendrites unlikely to imply a substantial increase in FLOP/s.

2.1.2.3 Crabs, locusts, and other considerations

Here are some other considerations relevant to the FLOP/s costs of firing decisions.

Other experimentally accessible circuits

The retina is not the only circuit where we have (a) some sense of what task it’s performing, and (b) relatively good experimental access. Here are two others I looked at that seem amenable to simplified modeling.

A collection of ~30 neurons in the decapod crustacean stomach create rhythmic firing patterns that control muscle movements. Plausibly, maintaining these rhythms is the circuit’s high-level task.229 Such rhythms can be modeled well using single-compartment, Hodgkin-Huxley-type neuron models.230 And naively, it seems to me like they could be re-implemented directly without using neuron models at all.231 What’s more, very different biophysical parameters (for example, synapse strengths and intrinsic neuron properties) result in very similar overall network behavior, suggesting that replicating task-performance does not require replicating a single set of such parameters precisely.232 That said, Prof. Eve Marder, an expert on this circuit, noted that the circuit’s biophysical mechanisms function in part to ensure smooth transitions between modes of operation – transitions that most computational models cannot capture.233
In a circuit involved in locust collision avoidance, low-level biophysical dynamics in the dendrites and cell body of a task-relevant neuron are thought to implement high-level mathematical operations (logarithm, multiplication, addition) that a computational model could replicate directly.234

I expect that further examination of the literature would reveal other examples in this vein.235

Selection effects

Neuroscientific success stories might be subject to selection effects.236 For example, the inference “A, B, and C can be captured with simple models, therefore probably X, Y, and Z can too” is bad if the reason X, Y, and Z haven’t yet been so captured is that they can’t be.

However, other explanations may also be available. For example, it seems plausible to me we’ve had more success in peripheral sensory/motor systems than deeper in the cortex because of differences in the ease with which task-relevant inputs and outputs can be identified, measured, and manipulated, rather than differences in the computation required to run adequate models of neurons in those areas.237 And FLOP/s requirements do not seem to be the major barrier to e.g. C. elegans simulation.238

Evolutionary history

Two experts (one physicist, one neuroscientist) mentioned the evolutionary history of neurons as a reason to think that they don’t implement extremely complex computations. The basic thought here seemed to be something like: (a) neurons early in evolutionary history seem likely to have been doing something very simple (e.g., basic stimulus-response behavior), (b) we should expect evolution to tweak and recombine these relatively simple components, rather than to add a lot of complex computation internal to the cells, and (c) indeed, neurons in the human brain don’t seem that different from neurons in very simple organisms.239 I haven’t looked into this, but it seems like an interesting angle.240

Communication bottlenecks

A number of experts mentioned limitations on the bits that a neuron receives as input and sends as output (limitations imposed by e.g. firing precision, the number of distinguishable synaptic states, etc.) as suggestive of a relatively simple input-output mapping.241

I’m not sure exactly how this argument works (though I discuss one possibility in the communication method section). In theory, very large amounts of computation can be required to map a relatively small number of possible inputs (e.g., the product of two primes, a boolean formula) to a small a number of possible outputs (e.g., the prime factors, a bit indicating whether the formula is satisfiable).242 For example, RSA-240 is ~800 bits (if we assume 1000-10,000 input synapses, each receiving 1 spike/s in 1 of 1000 bins, a neuron would be receiving ~10-100k bits/s),243 but it took ~900 core years on a 2.1 Ghz CPU to factor.244 And the bits that the human brain as a whole receives and outputs may also be quite limited relative to the complexity of its information-processing (Prof. Markus Meister suggested ~10-40 bits per second for various motor outputs).245

Of course, naively, neurons (indeed, brains) don’t seem to be factorizing integers. Indeed, in general, I think this may well be a good argument, and I welcome attempts to make it more explicit and quantified. Suppose, for example, that a neuron receives ~10-100k bits/s and outputs ~10 bits/s. What would this suggest about the FLOP/s required to reproduce the mapping, and why?

Ability to replicate known types of neuron behavior

According to Izhikevich (2004), some neuron models, such as simple integrate-and-fire models, can’t replicate known types of neuron behaviors, some of which (like adaptations in spike frequency over time, and spike delays that depend on the strength of the inputs)246 seem to me plausibly important to task-performance:247

model chart — **Figure 9: Diagram of which behaviors different models can capture**. © 2004 IEEE. Reprinted, with permission, from Izhikevich, Eugene. “Which model to use for cortical spiking neurons?”. IEEE Transactions on Neural Networks, Vol. 15, No. 5, 2004. Original caption: “Comparison of the neuro-computational properties of spiking and bursting models; see Fig. 1. ‘#of FLOPS’ is an approximate number of floating point operations (addition, multiplication, etc.) needed to simulate the model during a 1 ms time span. Each empty square indicates the property that the model should exhibit in principle (in theory) if the parameters are chosen appropriately, but the author failed to find the parameters within a reasonable period of time.”

Note, though, that Izhikevich suggests that his own model can capture these behaviors, for 13 FLOPs per ms.

Simplifying the Hodgkin-Huxley model

Some experts argue that the Hodgkin-Huxley model can be simplified:

Prof. Dong Song noted that the functional impacts of its ion channel dynamics are highly redundant, suggesting that you can replicate the same behavior with fewer equations.248
Izhikevich (2003) claims that “[His simplified neuron model] consists of only two equations and has only one nonlinear term, i.e., v². Yet … the difference between it and a whole class of biophysically detailed and accurate Hodgkin–Huxley-type models, including those consisting of enormous number of equations and taking into account all possible information about ionic currents, is just a matter of coordinate change.”249

ANNs and interchangeable non-linearities

Artificial neural networks (ANNs) have led to breakthroughs in AI, and we know they can perform very complex tasks.250 Yet the individual neuron-like units are very simple: they sum weighted inputs, and their “firing decisions” are simple non-linear operations, like a ReLU.251

The success of ANNs is quite compatible with the biological neurons doing something very different. And comparisons between brains and exciting computational paradigms can be over-eager.252 Still, knowing that ANN-like units are useful computational building-blocks makes salient the possibility that biological neurons are useful for similar reasons. Alternative models, including ones that incorporate biophysical complications that ANNs ignore, cannot boast similar practical success.

What’s more, the non-linear operations used in artificial neurons are, at least to some extent, interchangeable.253 That is, instead of a ReLU, you can use e.g., a sigmoid (though different operations have different pros and cons). If we pursue the analogy with firing decisions, this interchangeability might suggest that the detailed dynamics that give rise to spiking are less important than the basic function of passing synaptic inputs through some non-linearity or other.

On a recent podcast, Dr. Matthew Botvinick also mentions a chain of results going back to the 1980s showing that the activity in the units of task-trained deep learning systems bears strong resemblance to the activity of neurons deep in the brain. I discuss a few recent visual cortex results in this vein in Section 3.2, and note a few other recent results in Section 3.3.254 Insofar as a much broader set of results in this vein is available, that seems like relevant evidence as well.

Intuitive usefulness

One of our technical advisors, Dr. Paul Christiano, noted that from a computer science perspective, the Hodgkin-Huxley model just doesn’t look very useful. That is, it’s difficult to describe any function for which (a) this model is a useful computational building block, and (b) its usefulness arises from some property it has that simpler computational building blocks don’t.255 Perhaps something similar could be said of even more detailed biophysical models.

Note, though, advocates of large compute burdens need not argue that actual biophysical models themselves are strictly necessary; rather, they need only argue for the overall complexity of a neuron’s input-output transformation.

Noise bounds

Various experts suggest that noise in the brain may provide an upper bound on the compute required to do what it does.256 However, I’m not sure how to identify this bound, and haven’t tried.

2.1.2.4 Expert opinion and practice

There is no consensus in neuroscience about what models suffice to capture task-relevant neuron behavior.257

A number of experts indicated that in practice, the field’s emphasis is currently on comparatively simple models, rather than on detailed modeling.258 But this evidence is indirect. After all, the central question a neuroscientist needs to ask is not (a) “what model is sufficient, in principle, to replicate task-relevant behavior?”, but rather (b) “what model will best serve the type of neuroscientific understanding I am aiming to advance, given my constraints?”.

Indeed, much discussion of model complexity is practical: it is often said that biophysical models are difficult to compute, fit to data, and understand; that simpler models, while better on these fronts, come at the cost of biological realism; and that the model you need depends on the problem at hand.259 Thus, answers to (a) and (b) can come apart: you can think that ultimately, we’ll need complex models, but that simpler ones are more useful given present constraints; or to that ultimately, simplifications are possible, but detailed modeling is required to identify them.260

Still, some experts answer (a) explicitly. In particular:

A number of experts I spoke to expected comparatively simple models (e.g., simpler than Hodgkin-Huxley) to be adequate.261 I expect many computational neuroscientists who have formed opinions on the topic (as opposed to remaining agnostic) to share this view.262
Various experts suggest that some more detailed biophysical models are adequate.263
In an informal poll of participants at a 2007 workshop on Whole Brain Emulation, the consensus appeared to be that a level of detail somewhere between a “spiking neural network” and the “metabolome” would be adequate (strong selection effects likely influenced who was present).264

A number of other experts I spoke with expressed more uncertainty, agnosticism, and sympathy towards higher end estimates.265 And many (regardless of specific opinion) suggested that views about this topic (including, sometimes, their own) can emerge in part from gut feeling, a desire for one’s own research to be important/tractable, and/or from the tradition and assumptions one was trained in.266

2.1.2.5 Overall FLOP/s for firing decisions

Where does this leave us in terms of overall FLOP/s for firing decisions? Here’s a chart with some examples of possible levels of complexity, scaled up to the brain as a whole:

**Figure 10: FLOP/s budgets for different models of neuron firing decisions**
ANCHOR	FLOPS	SIZE OF TIMESTEP	FLOP/S FOR 1E11 NEURONS
ReLU	1 FLOP per operation267	10 ms268	1e13
Izhikevich spiking neuron model	13 FLOPs per ms269	1 ms270	~1e15
Single compartment Hodgkin-Huxley model	120 FLOPs per .1 ms271	.1 ms272	~1e17
Beniaguev et al. (2020) DNN	1e7 FLOPs per ms273	1 ms	~1e21
Hay et al. (2011) detailed L5PC model	1e10 FLOPs per ms?274	?	1e24?

Even the lower-end numbers here are competitive with the budgets for synaptic transmission above (1e13-1e17 FLOP/s). This might seem surprising, given the difference in synapse and neuron count. But as I noted at the beginning of the section, the budgets for synaptic transmission were based on average firing rates; whereas I’m here assuming that firing decisions must be computed once per time-step (for some given size of time-step).275

This assumption may be mistaken. Dr. Paul Christiano, for example, suggested that it would be possible to accumulate inputs over some set of time-steps, then calculate what the output spike pattern would have been over that period.276 And Sarpeshkar (2010) appears to assume that the FLOP/s he budgets for firing decisions (enough for 1 ms of Hodgkin-Huxley model) need only be used every time the neuron spikes.277 If something like this is true, the numbers would be lower.

Other caveats:

I’m leaning heavily on the FLOPs estimates in Izhikevich (2004), which I haven’t verified.
Actual computation burdens for running e.g. a Hodgkin-Huxley model depend on implementation details like platform, programming language, integration method, etc.278
In at least some conditions, simulations of integrate-and-fire neurons can require very fine grained temporal resolution (e.g., 0.001 ms) to capture various properties of network behavior.279 Temporal resolutions like this would increase the numbers above considerably. However, various other simulations using simplified spiking neuron models, such as the leaky-integrate-and-fire simulations run by Prof. Chris Eliasmith (which actually perform tasks like recognizing numbers and predicting sequences of them), use lower resolutions.280
The estimate above for Hay et al. (2011) is especially rough.281
The high end of this chart is not an upper bound on modeling complexity. Biophysical modeling can in principle be arbitrarily detailed.

Overall, my best guess is that the computation required to run single-compartment Hodgkin-Huxley models of every neuron in the brain (1e17 FLOP/S, on the estimate above) is overkill for capturing the task-relevant dimensions of firing decisions. This is centrally because:

Efforts to predict neuron behavior using simpler models (including simplified models of dendritic computation) appear to have had a decent amount of success (though these results also have many limitations, and I’m not in a great position to evaluate them).
With the exception of Beniaguev et al. (2020), I don’t see much positive evidence that dendritic computation alters this picture dramatically.
I find some of the considerations canvassed in Section 2.1.2.3 (other simple circuits; the success of ANNs with simple, interchangeable non-linearities) suggestive; and I think that others I don’t understand very well (e.g., communication bottlenecks, mathematical results showing that the Hodgkin-Huxley equations can be simplified) may well be quite persuasive on further investigation.
My impression is that a substantial fraction (maybe a majority?) of computational neuroscientists who have formed positive opinions about the topic (as opposed to remaining agnostic) would also think that single-compartment Hodgkin-Huxley is overkill for capturing task-performance (though it may be helpful for other forms of neuroscientific understanding).

Thus, I’ll use 1e17 FLOP/s as a high-end estimate for firing decisions.

The Izhikevich spiking neuron model estimate (1e15 FLOP/s) seems to me like a decent default estimate, as it can capture more behaviors than a simple integrate-and-fire model, for roughly comparable FLOP/s (indeed, Izhikevich seems to argue that it can do anything a Hodgkin-Huxley model can). And if simpler operations (e.g., a ReLU) and/or lower time resolutions are adequate, we’d drop to something like 1e13 FLOP/s, possibly lower. I’ll use 1e13 FLOP/s as a low end, leaving us with an overall range similar to the range for synaptic transmission: 1e13 to 1e17 FLOP/s.

2.2 Learning

Thus far, we have been treating the synaptic weights and firing decision mappings as static over time. In reality, though, experience shapes neural signaling in a manner that improves task performance and stores task-relevant information. I’ll call these changes “learning.”

Some of these may proceed via standard neuron signaling (for example, perhaps firing patterns in networks with static weights could store short-term memories).282 But the budgets thus far already cover this. Here I’ll focus on processes that we haven’t yet covered, but which are thought to be involved in learning. These include:

Synaptic weights change over time (“synaptic plasticity”). These changes are often divided into categories:
- Short-term plasticity (e.g., changes lasting from hundreds of milliseconds to a few seconds).
- Long-term plasticity (changes lasting longer).283
The type of synaptic plasticity neurons exhibit can itself change (“meta-plasticity”).
The electric properties of the neurons (for example, ion channel expression, spike threshold, resting membrane potential) also change (“intrinsic plasticity”).284
New neurons, synapses, and dendritic spines grow over time, and old ones die.285

Such changes can be influenced by many factors, including pre-synaptic and post-synaptic spiking,286 receptor activity in the post-synaptic dendrite,287 the presence or absence of various neuromodulators,288 interactions with glial cells,289 chemical signals from the post-synaptic neuron to the pre-synaptic neuron,290 and gene expression.291 There is a lot of intricate molecular machinery plausibly involved,292 which we don’t understand well and which can be hard to access experimentally293 (though some recent learning models attempt to incorporate it).294 And other changes in the brain could be relevant as well.295

Of course, many tasks (say, tying your shoes) don’t require much learning, once you know how to do them. And many tasks are over before some of the mechanisms above have had time to have effects, suggesting that such mechanisms can be left out of FLOP/s budgets for those tasks.296

But learning to perform new tasks, sometimes over long timescales, is itself a task that the brain can perform. So a FLOP/s estimate for any task that the brain can perform needs to budget FLOP/s for all forms of learning.

How many FLOP/s? Here are a few considerations.

2.2.1 Timescales

Some of the changes involved in learning occur less frequently than spike through synapses. Growing new neurons, synapses, and dendritic spines is an extreme example. At a glance, the number of new neurons per day in adult humans appears to be on the order of hundreds or less;297 and Zuo et al. (2005) report that over two weeks, only 3%-5% of dendritic spines in adult mice were eliminated and formed (though Prof. Erik De Schutter noted that networks of neurons can rewire themselves over tens of minutes).298 Because these events are so comparatively rare, I expect modeling their role in task-performance to be quite cheap relative to e.g. 1e14 spikes through synapses/sec.299 This holds even if the number of FLOPs required per event is very large, which I don’t see strong reason to expect.

Something similar may apply to some other types of changes to e.g. synaptic weights and intrinsic neuron properties:

Some long-term changes require building new biochemical machinery (receptors, ion channels, etc.), which seems resource-intensive relative to e.g. synaptic transmission (though I don’t have numbers here).300 This suggests limitations on frequency.
If a given type of change lasts a long time in vivo (and hence, is not “reset” very frequently) or is triggered primarily by relatively rare events (e.g., sustained periods of high-frequency pre-synaptic spiking), this could also suggest such limitations.301
It seems plausible that some amount of stability is required for long-term information storage.302

More generally, some biochemical mechanisms involved in learning are relatively slow-moving. The signaling cascades triggered by some neuromodulators, for example, are limited by the speed of chemical diffusion, which Koch (1999) suggests extends their timescales to seconds or longer;303 Bhalla (2014) characterizes various types of chemical computation within synapses as occurring on timescales of seconds;304 and Yap and Greenberg (2018) characterize gene transcription taking place over minutes as “rapid.”305 This too might suggest limits on required FLOP/s.

I discuss arguments that appeal to timescales in more detail in Section 2.3. As I note there, I don’t think these arguments are conceptually airtight, but I find them suggestive nonetheless, and I expect them to apply to many processes involved in learning.

That said, the frequency with which a given change occurs does not necessarily limit the frequency with which biophysical variables involved in the process need to be updated, or decisions made about what changes to implement as a result.306 What’s more, some forms of synaptic plasticity occur on short timescales, reflecting rapid changes in e.g. calcium or neurotransmitter in a synapse;307 and Bhalla (2014) notes that spike-timing dependent plasticity “requires sharp temporal discrimination of the order of a few milliseconds” (p. 32).

2.2.2 Existing models

There is no consensus model for how the brain learns,308 and the training required to create state of the art AI systems seems in various ways comparatively inefficient.309 There is debate over comparisons with learning algorithms like backpropagation310 (along with meta-debate about whether this debate is meaningful or worthwhile).311

Still, different models can at least serve as examples of possible FLOP/s costs. Here are a few that came up in my research.

**Figure 11: Some example learning models**
LEARNING MODEL	DESCRIPTION	FLOP/S COSTS	EXPERT OPINION
Hebbian rules	Classic set of models. A synapse strengthens or weakens as a function of pre-synaptic spiking and post-synaptic spiking, possibly together with some sort of external modulation/reward.312	3-5 FLOPs per synaptic update?313	Prof. Anthony Zador expected the general outlines to be correct.314 Prof. Chris Eliasmith uses a variant in his models.315
Benna and Fusi (2016)	Models synapses as a dynamical system of variables interacting on multiple timescales. May help resolve the “stability-plasticity dilemma,” on which overly plastic synapses are too easily overwritten, but overly rigid synapses are unable to learn. May also help with online learning.	~2-30x the FLOPs to run a model with one parameter per synapse? (very uncertain)316	Some experts argue that shifting to synaptic models of this kind, involving dynamical interactions, is both theoretically necessary and biologically plausible.317
First order gradient descent methods	Use slope of the loss function to minimize the loss.318 Widespread use in machine learning. Contentious debate about biological plausibility.	~2× a static network. The learning step is basically a backwards pass through the network, and going forward and backward come at roughly the same cost.319	Prof. Konrad Kording, Prof. Barak Pearlmutter, and Prof. Blake Richards favored estimates based on this anchor/in this range of FLOP/s costs.320
Second order gradient descent methods	Take into account not just the slope of the loss function, but also the curvature. Arguably better than gradient descent methods, but require more compute, so used more rarely.321	Large. Compute per learning step scales as a polynomial with the number of neurons and synapses in a network.322	Dr. Paul Christiano thought it very implausible that the brain implements a rule of this kind.323 Dr. Adam Marblestone had not seen any proposals in this vein.324
Node-perturbation algorithms	Involves keeping/consolidating random changes to the network that result in reward, and getting rid of changes that result in punishment. As the size of a network grows, these take longer to converge than first-order gradient methods.325	<2× a static network (e.g., less than first-order gradient descent methods).326	Prof. Blake Richards thought that humans learn with less data than this kind of algorithm would require.327

Caveats:

This is far from an exhaustive list.328
The brain may be learning in a manner quite dissimilar from any known learning models. After all, it succeeds in learning in ways we can’t replicate with artificial systems.
I haven’t investigated these models much: the text and estimates above are based primarily on comments from experts (see endnotes for citations). With more time and expertise, it seems fairly straightforward to generate better FLOP/s estimates.
Synaptic weights are often treated as the core learned parameters in the brain,329 but alternative views are available. For example, Prof. Konrad Kording suggested that the brain could be optimizing ion channels as well (there are considerably more ion channels than synapses).330 Thus, the factor increase for learning need not be relative to a static model based on synapses.
As noted above, some of what we think of as learning and memory may be implemented via standard neuron signaling, rather than via modifications to e.g. synaptic weights/firing decisions.

With that said, a number of these examples seem to suggest relatively small factor increases for learning, relative to some static baseline (though what that baseline should be is a further question). Second-order gradient methods would be more than this, but I have yet to hear anyone argue that the brain uses these, or propose a biological implementation. And node perturbation would be less (though this may require more data than humans use).

2.2.3 Energy costs

If we think that FLOP/s costs correlate with energy expenditure in the brain, we might be able to estimate the FLOP/s costs for learning via the energy spent on it. For example, Lennie (2003) estimates that >50% of the total energy in the neocortex goes to processes involved in standard neuron signaling – namely, maintaining resting potentials in neurons (28%), reversing Na⁺ and K⁺ fluxes from spikes (13%), and spiking itself (13%).331 That would leave <50% for (a) other learning process beyond this and (b) everything else (maintaining glial resting potentials is another 10%). Very naively, this might suggest less than a 2× factor for learning, relative to standard neuron signaling.

Should we expect FLOP/s costs to correlate with energy expenditure? Generally speaking, larger amounts of information-processing take more energy, so the thought seems at least suggestive (e.g., it’s somewhat surprising if the part of your computer doing 99% of the information-processing is using less than half the energy).332 In the context of biophysical modeling, though, it’s less obvious, as depending on the level of detail in question, modeling systems that use very little energy can be very FLOP/s intensive.

2.2.4 Expert opinion

A number of experts were sympathetic to FLOP/s budgets for learning in the range of 1-100 FLOPs per spike through synapse.333 Some of this sympathy was based on using (a) Hebbian models, or (b) first-order gradient descent models as an anchor.

Sarpeshkar (2010) budgets at least 10 FLOPs per spike through synapse for synaptic learning.334 Other experts expressed agnosticism and/or openness to much higher numbers;335 and one (Prof. Konrad Kording) argued for estimates based on ion-channel plasticity, rather than synaptic plasticity.336

2.2.5 Overall FLOP/s for learning

Of the many uncertainties afflicting the mechanistic method, the FLOP/s required to capture learning seems to me like one of the largest. Still, based on the timescales, algorithmic anchors, energy costs, and expert opinions just discussed, my best guess is that learning does not push us outside the range already budgeted for synaptic transmission: e.g., 1-100 FLOPs per spike through synapse.

Learning might well be in the noise relative to synaptic transmission, due to the timescales involved.
1-10 FLOPs per spike through synapse would cover various estimates for short-term synaptic plasticity and Hebbian plasticity; along with factors of 2× or so (à la first order gradient descent anchors, or the run-time slow-down in Kaplanis et al. (2018)) on top of lower-end synaptic transmission estimates.
100 FLOPs per spike through synapse would cover the higher-end Benna-Fusi estimate above (though this was very loose), as well as some cushion for other complexities.

To me, the most salient route to higher numbers uses something other than spikes through synapses as a baseline. For example, if we used timesteps per second at synapses instead, and 1 ms timesteps, then X FLOPs per timestep per synapse for learning would imply X × 1e17-1e18 FLOP/s (assuming 1e14-15 synapses). Treating learning costs as scaling with ion channel dynamics (à la Prof. Konrad Kording’s suggestion), or as a multiplier on higher-end standard neuron signaling estimates, would also yield higher numbers.

I could also imagine being persuaded by arguments of roughly the form: “A, B, and C simple models of learning lead to X theoretical problems (e.g., catastrophic forgetting), which D more complex model solves in a biologically plausible way.” Such an argument motivates the model in Benna and Fusi (2016), which boasts some actual usefulness to task-performance to boot (e.g. Kaplanis et al. (2018)). There may be other models with similar credentials, but higher FLOP/s costs.

I don’t, though, see our ignorance about how the brain learns as a strong positive reason, just on their own, to think larger budgets are required. It’s true that we don’t know enough to rule out such requirements. But “we can’t rule out X” does not imply “X should be our best guess.”

2.3 Other signaling mechanisms

Let’s turn to other signaling mechanisms in the brain. There are a variety. They tend to receive less attention than standard neuron signaling, but some clearly play a role in task-performance, and others might.

Our question, though, is not whether these mechanisms matter. Our question is whether they meaningfully increase a FLOP/s budget that already covers standard neuron signaling and learning.337

As a preview: my best guess is that they don’t. This is mostly because:

My impression is that most experts who have formed opinions on the topic (as opposed to remaining agnostic) do not expect these mechanisms to account for the bulk of the brain’s information-processing, even if they play an important role.338
Relative to standard neuron signaling, each of the mechanisms I consider is some combination of (a) slower, (b) less spatially-precise, (c) less common in the brain (or, not substantially more common), or (d) less clearly relevant to task-performance.

But of course, familiar caveats apply: there’s a lot we don’t know, experts might be wrong (and/or may not have given this issue much attention), and the arguments aren’t conclusive.

Arguments related to (a)-(d) will come up a few times in this section, so it’s worth a few general comments about them up front.

Speed

If a signaling mechanism X involves slower-moving elements, or processes that take longer to have effects, than another mechanism Y, does this suggest a lower FLOP/s budget for X, relative to Y? Heuristically, and other things equal: yes, at least to my mind. That is, naively, it seems harder to perform lots of complex, useful information-processing per second using slower elements/processes (computers using such elements, for example, are less powerful). And various experts seemed to take considerations in this vein quite seriously.339

That said, other things may not be equal. X signals might be sent more frequently, as a result of more complex decision-making, with more complex effects, etc.340 What’s more, the details of actually measuring and modeling different timescales in the brain may complicate arguments that appeal to them. For example, Prof. Eve Marder noted that traditional views about timescales separations in neuroscience emerge in part from experimental and computational constraints: in reality, slow processes and fast processes interact.341

It’s also generally worth distinguishing between different lengths of time that can be relevant to a given signaling process, including:

How long it takes to trigger the sending of a signal X.
How long it takes for a signal X to reach its target Y.
How long it takes for X’s reaching Y to have effect Z.
How frequently signals X are sent.
How long effect Z can last.
How long effect Z does in fact last in vivo.

These can feed into different arguments in different ways. I’ll generally focus on the first three.

Spatial precision

If a signaling mechanism X is less spatially precise than another mechanism Y (e.g., signals arise from the combined activities of many cells, and/or affect groups of cells, rather than being targeted at individual cells), does this suggest lower FLOP/s budgets for X, relative to Y? Again: heuristically, and other things equal, I think it does. That is, naively, units that can send and receive individualized messages seem to me better equipped to implement more complex information-processing per unit volume. And various experts took spatial precision as an important indicator of FLOP/s burdens as well.342 Again, though, there is no conceptual necessity here: X might nevertheless be very complex, widespread, etc. relative to Y.

Number/frequency

If X is less common than Y, or happens less frequently, this seems to me a fairly straightforward pro tanto reason to budget fewer FLOP/s for it. I’ll treat it as such, even though clearly, it’s no guarantee.

Task-relevance

The central role of standard neuron signaling in task-performance is well established. For many of these alternative signaling mechanisms, though, the case is weaker. Showing that you can make something can happen in a petri dish, for example, is different from showing that it happens in vivo and matters to task-performance (let alone that it implies a larger FLOP/s budget than standard neuron signaling). Of course, in some cases, if something did happen in vivo and matter to task-performance, we couldn’t easily tell. But I won’t, on these grounds, assume that every candidate for such a role plays it.

Let’s look at the mechanisms themselves.

2.3.1 Other chemical signals

The brain employs many chemical signals other than the neurotransmitters involved in standard neuron signaling. For example:

Neurons release larger molecules known as neuropeptides, which diffuse through the space between cells.343
Neurons produce gases like nitric oxide and carbon monoxide, as well as lipids known as endocannabinoids, both of which can pass directly through the cell membrane.344

Chemicals that neurons release that regulate the activity of groups of neurons (or other cells) are known as neuromodulators.345

Chemical signals other than classical neurotransmitters are very common in the brain,346 and very clearly involved in task performance.347 For example, they can alter the input-output function of individual neurons and neural circuits.348

However, some considerations suggest limited FLOP/s budgets, relative to standard neuron signaling:

Speed: Signals that travel through the extracellular space are limited by the speed of chemical diffusion, and some travel distances much longer than a 20 nm synaptic cleft.349 What’s more, nearly all neuropeptides act via metabotropic receptors, which take longer to have effects on a cell than the ionotropic receptors involved in standard neuron signaling.350
Spatial precision: Some (maybe most?) of these chemical signals act on groups of cells. As Leng and Ludwig (2008) put it: “peptides are public announcements … they are messages not from one cell to another, but from one population of neurones to another.”351
Frequency: Neuropeptides are released less frequently than classical neurotransmitters. For example, Leng and Ludwig (2008) suggest that the release of a vesicle containing neuropeptide requires “several hundred spikes,” and that oxytocin is released at a rate of “1 vesicle per cell every few seconds.”352 This may be partly due to resource constraints (neuropeptides, unlike classic neurotransmitters, are not recycled).353
Because neuromodulators play a key role in plasticity, some of their contributions may already fall under the budget for learning.

This is a coarse-grained picture of a very diverse set of chemical signals, some of which may not be so e.g. slow, imprecise, or infrequent. Still, a number of experts treat these properties as reasons to think that the FLOP/s for chemical signaling beyond standard neuron signaling would not add much to the budget.354

2.3.2 Glia

Neurons are not the only brain cells. Non-neuron cells known as glia have traditionally been thought to mostly act to support brain function, but there is evidence that they can play a role in information-processing as well.355

This evidence appears to be strongest with respect to astrocytes, a star-shaped type of glial cell that extend thin arms (“processes”) to enfold blood vessels and synapses.

Mu et al. (2019) suggest that zebra fish astrocytes “perform a computation critical for behavior: they accumulate evidence that current actions are ineffective and consequently drive changes in behavioral states.”356
Astrocytes exhibit a variety of receptors, activation of which leads to increases in the concentration of calcium within the cell and consequently the release of transmitters.357
Changes in calcium concentrations can propagate across networks of astrocytes (a calcium “wave”) enabling a form of signaling over longer-distances.358 Sodium dynamics appear to play a signaling role as well.359
Astrocytes can also signal to neurons by influencing concentrations of ions or neurotransmitters in space between cells.360 They can regulate neuron activity, a variety of mechanisms exist via which they can influence short-term plasticity, and they are involved in both long-term plasticity and in the development of new synapses.361
Human astrocytes also appear to be larger, and to exhibit more processes, than those of rodents, which has led to speculation that they play a role in explaining the human brain’s processing power.362

Other glia may engage in signaling as well. For example:

NG2 protein-expressing oligodendrocyte progenitor cells can receive synaptic input from neurons, form action potentials, and regulate synaptic transmission between neurons.363
Glial cells involved in the creation of myelin (the insulated sheath that surrounds axons) can detect and respond to axonal activity.364

Would FLOP/s for the role of glia in task-performance meaningfully increase our budget? Here are some considerations:

Speed: Astrocytes can respond to neuronal events within hundreds of milliseconds,365 and they can detect individual synaptic events.366 However, the timescales of other astrocyte calcium dynamics are thought to be slower (on the order of seconds or more), and some effects require sustained stimulation.367
Spatial resolution: Previous work assumed that astrocyte calcium signaling could not be spatially localized to e.g. a specific cellular compartment, but this appears to be incorrect.368
Number: The best counting methods available suggest that the ratio of glia to neurons in the brain is roughly 1:1 (it was previously thought to be 10:1, but this appears to be incorrect).369 This ratio varies across regions of the brain (in the cerebral cortex, it’s about 3:1).370 Astrocytes appear to be about 20-40% of glia (though these numbers may be questionable);371 and NG2 protein-expressing oligodendrocyte progenitor cells discussed above are only 2-8% of the total cells in the cortex.372 If the average FLOP/s cost per glial cell were the same as the average per neuron, this would likely less than double our budget.373 That said, astrocytes may have more connections to other cells, on average, than neurons.374
Energy costs: Neurons consume the majority of the brain’s energy. Zhu et al. (2012) estimate that “a non-neuronal cell only utilizes approximately 3% of that [energy] used by a neuron in the human brain” – a ratio which they take to suggest that neurons account for 96% of the energy expenditure in human cortical grey matter, and 68% in white matter.375 Attwell and Laughlin (2001) also predict a highly lopsided distribution of signaling-related energy consumption between neurons and glia in grey matter – a distribution supported by the observed distribution of mitochondria they suggest is found in Wong-Riley (1989) (see figure below). If glial cells were doing more information-processing than neurons, they would have to be doing it using much less energy – a situation in which, naively, it would appear metabolically optimal to have more glial cells than neurons. To me, the fact that neurons receive so much more of a precious resource suggests that they are the more central signaling element.376

**Figure 12: Comparing neuron and glia energy usage in grey matter**. From Attwell, David and Laughlin, Simon. “An Energy Budget for Signaling in the Grey Matter of the Brain”, Journal of Cerebral Blood Flow and Metabolism, 21:1133–1145, 2001; FIG. 3B, p. 1140, © 2001 The International Society for Cerebral Blood Flow and Metabolism. Reprinted by Permission of SAGE Publications, Ltd. FIG. 3A in the original text is not shown, original caption in endnote.377

Overall, while some experts are skeptical of the importance of glia to information-processing, the evidence that they play at least some role seems to me fairly strong.378 How central of a role, though, is a further question, and the total number of glial cells, together with their limited energy consumption relative to neurons, does not, to me, initially suggest that capturing this role would require substantially more FLOP/s than capturing standard neuron signaling and learning.

2.3.3 Electrical synapses

In addition to the chemical synapses involved in standard neuron signaling, neurons (and other cells) also form electrical synapses – that is, connections that allow ions and other molecules to flow directly from one cell into another. The channels mediating these connections are known as gap junctions.

These have different properties than chemical synapses. In particular:

Electrical synapses are faster, passing signals in a fraction of a millisecond.379
Electrical synapses can be bi-directional, allowing each cell to influence the other.380
Electrical synapses allow graded transmission of sub-threshold electrical signals.381

My impression is that electrical synapses receive much less attention in neuroscience than chemical synapses. This may be because they are thought to be some combination of:

Much less common.382
More limited in the behavior they can produce (chemical synapses, for example, can amplify pre-synaptic signals).383
Involved in synchronization between neurons, or global oscillation, that does not imply complex information-processing.384
Amenable to very simple modeling.385

Still, electrical synapses can play a role in task-performance,386 and one expert suggested that they could create computationally expensive non-linear dynamics.387 What’s more, if they are sufficiently fast, or require sufficiently frequent updates, this could compensate for their low numbers. For example, one expert suggested that you can model gap junctions as synapses that update every timestep.388 But if chemical synapses only receive spikes, and hence update, ~once per second, and we use 1 ms timesteps, you’d need to have ~1000x fewer gap junctions in order for their updates not to dominate.

Overall, my best guess is that incorporating electrical synapses would not substantially increase our FLOP/s budget, but this is centrally based on a sense that experts treat their role in information-processing as relatively minor.

2.3.4 Ephaptic effects

Neuron activity creates local electric fields that can have effects on other neurons. These are known as ephaptic effects. We know that these effects can occur in vitro (see especially Chiang et al. (2019))389 and entrain action potential firing,390 and Chiang et al. (2019) suggest that they may explain slow oscillations of neural activity observed in vivo.391

A recent paper, though, suggests that the question of whether they have any functional relevance in vivo remains quite open,392 and one expert thought them unlikely to be important to task-performance.393

One reason for doubt is that the effects on neuron membrane potential appear to be fairly small (e.g., <0.5 mV, compared with the ~15 mV gap between resting membrane potential and the threshold for firing),394 and may be drowned out by noise artificially lacking in vitro.395

Even if they were task-relevant, though, they would be spatially imprecise – arising from, and exerting effects on, the activity of groups of neurons, rather than on individual cells. Two experts took this as reason to think their role in task-performance would not be computationally expensive to capture.396 That said, actually modeling electric fields seems plausibly quite FLOP/s-intensive.397

2.3.5 Other forms of axon signaling

Action potentials are traditionally thought of as binary choices – a neuron fires, or it doesn’t – induced by changes to somatic membrane potential, and synaptic transmission as a product of this binary choice.398 But in some contexts, this is too simple. For example:

The waveform of an action potential (that is, its amplitude and duration) can vary in a way that affects neurotransmitter release.399
Variations in the membrane potential that occur below the threshold of firing (“subthreshold” variations) can also influence synaptic transmission.400
Certain neurons – for example, neurons in early sensory systems,401 and neurons in invertebrates402 – also release neurotransmitter continually, in amounts that depend on non-spike changes to pre-synaptic membrane potential.403
Some in vitro evidence suggests that action potentials can arise in axons without input from the soma or dendrites.404

Do these imply substantial increases to FLOP/s budgets? Most of the studies I looked at seemed to be more in the vein of “here is an effect that can be created in vitro” than “here is a widespread effect relevant to in vivo task-performance,” but I only looked into this very briefly, the possible mechanisms/complexities are diverse, and evidence of the latter type is rare regardless.

Some effects (though not all)405 also required sustained stimulation (e.g., “hundreds of spikes over several minutes,”406 or “100 ms to several seconds of somatic depolarization”407); and the range of neurons that can support axon signaling via sub-threshold membrane potential fluctuations also appears somewhat unclear, as the impact of such fluctuations is limited by the voltage decay along the axon.408

Overall, though, I don’t feel very informed or clear about this one. As with electrical synapses, I think the central consideration for me is that the field doesn’t seem to treat it as central.

2.3.6 Blood flow

Blood flow in the brain correlates with neural activity (this is why fMRI works). This is often explained via the blood’s role in maintaining brain function (e.g., supplying energy, removing waste, regulating temperature).409 Moore and Cao (2008)), though, suggest that blood flow could play an information-processing role as well – for example, by delivering diffusible messengers like nitric oxide, altering the shape of neuron membranes, modulating synaptic transmission by changing brain temperatures, and interacting with neurons indirectly via astrocytes.410 The timescales of activity-dependent changes in blood flow are on the order of hundreds of milliseconds (the effects of such changes often persist after a stimulus has ended, but Moore and Cao believe this is consistent with their hypothesis).411

My impression, though, is that most experts don’t think that blood flow plays a very direct or central role in information-processing.412 And the spatial resolution appears fairly coarse regardless: Moore and Cao (2008) suggest resolution at the level of a cortical column (a group of neurons413), or an olfactory glomerulus (a cluster of connections between cells).414

2.3.7 Overall FLOP/s for other signaling mechanisms

Here is a chart summarizing some of the considerations just canvassed (see the actual sections for citations).

MECHANISM	DESCRIPTION	SPEED	SPATIAL PRECISION	NUMBER/FREQUENCY	EVIDENCE FOR TASK-RELEVANCE
Other chemical signals	Chemical signals other than classical neurotransmitters. Includes neuropeptides, gases like nitrous oxide, endocannabinoids, and others.	Limited by the speed of chemical diffusion, and by the timescales of metabotropic receptors.	Imprecise. Affect groups of cells by diffusing through the extracellular space and/or through cell membranes, rather than via synapses.	Very common. However, some signal broadcasts are fairly rare, and may take ~400 spikes to trigger.	Strong. Can alter circuit dynamics and neuron input-output functions, role in synaptic plasticity.
Glia	Non-neuron cells traditionally thought to play a supporting role in the brain, but some of which may be more directly involved in task-performance.	Some local calcium responses within ~100 ms; other calcium signaling on timescales of seconds or longer.	Can respond locally to individual synaptic events.	~1:1 ratio with neurons (not 100:1). Astrocytes (the most clearly task-relevant type of glial cell) are only 20-40% of glia.	Moderate. Role in zebrafish behavior. Plausible role in plasticity, synaptic transmission, and elsewhere. However, glia have a much smaller energy budget than neurons.
Electrical synapses	Connections between cells that allow ions and other molecules to flow directly from one to the other.	Very fast. Can pass signals in a fraction of a millisecond.	Precise. Signals are passed between two specific cells. But may function to synchronize groups of neurons.	Thought to be less common than chemical synapses (but may be passing signals more continuously, and/or require more frequent updates?).	Can play a role, but thought to be less important than chemical synapses? More limited range of signaling behaviors.
Ephaptic effects	Local electrical fields that can impact neighboring neurons.	? Some oscillations that ephaptic effects could explain are slow-moving. Unsure about speed of lower-level effects.	Imprecise. Arises from activity of many cells, effects not targeted to specific cells.	?	Weak? Small effects on membrane potential possibly swamped by noise in vivo.
Other forms of axon signaling	Processes in a neuron other than a binary firing decision that impact synaptic transmission.	? Some effects required sustained stimulation (minutes of spiking, 100 ms to seconds of depolarization). Others arose more quickly (15-50 ms of hyperpolarization).	Precise, proceeds via axons/individual synapses.	Unclear what range of neurons can support some of the effects (e.g., sub-threshold influences on synaptic transmission).	Some effects relevant in at least some species/contexts. Other evidence mostly from in vitro studies?
Blood flow	Some hypothesize that blood flow in the brain is involved in information-processing.	Responses within hundreds of ms, which persist after stimulus has ended.	Imprecise. At the level of a cortical column, or a cluster of connections between cells.	?	Weak. Widely thought to be epiphenomenal.

Figure 13: Factors relevant to FLOP/s budgets for other signaling mechanisms in the brain.

Obviously, my investigations were cursory, and there is a lot of room for uncertainty in each case. What’s more, the list is far from exhaustive,415 and other mechanisms may await discovery.416

Still, as mentioned earlier, my best guess is that capturing the role of other signaling mechanisms (known and unknown) in task-performance does not require substantially more FLOP/s than capturing standard neuron signaling and learning. This guess is primarily grounded in a sense that computational neuroscientists generally treat standard neuron signaling (and the plasticity thereof) as the primary vehicle of information-processing in the brain, and other mechanisms as secondary.417 An initial look at the speed, spatial precision, prevalence, and task-relevance of the most salient of these mechanisms seems compatible with such a stance, so I’m inclined to defer to it, despite the possibility that it emerges primarily from outdated assumptions and/or experimental limitations, rather than good evidence.

2.4 Overall mechanistic method FLOP/s

Here are the main numbers we’ve discussed thus far:

Standard neuron signaling: ~1e13-1e17 FLOP/s
Synaptic transmission: 1e13-1e17 FLOP/s
Spikes through synapse per second: 1e13-1e15
FLOPs per spike through synapse:

Low: 1 (one addition and/or multiply operation, reflecting impact on post-synaptic membrane potential)

High: 100 (covers 40 FLOPs for synaptic conductances, plus cushion for other complexities)
Firing decisions: 1e13-1e17 FLOP/s

Number of neurons: 1e11
FLOP/s per neuron:
Low: 100 (ReLU, 10 ms timesteps)
Middle: 10,000 (Izhikevich model, 1 ms timesteps)
High: 1,000,000 (single compartment Hodgkin-Huxley model, 0.1 ms timesteps)
Learning: <1e13 – 1e17 FLOP/s
Spikes through synapse per second: 1e13-1e15
FLOPs per spike through synapse:
Low: <1 (possibly due to slow timescales)
Middle: 1-10 (covers various learning models – Hebbian plasticity, first-order gradient methods, possibly Benna and Fusi (2016) – and expert estimates, relative to low end baselines)
High: 100 (covers those models with more cushion/relative to higher baselines).
Other signaling mechanisms: do not meaningfully increase the estimates above.

Overall range: ~1e13-1e17 FLOP/s418

To be clear: the choices of “low” and “high” here are neither principled nor fully independent, and I’ve rounded aggressively.419 Indeed, another, possibly more accurate way to summarize the estimate might be:

“There are roughly 1e14-1e15 synapses in the brain, receiving spikes about 0.1-1 times a second. A simple estimate budgets 1 FLOP per spike through synapse, and two extra orders of magnitude would cover some complexities related to synaptic transmission, as well as some models of learning. This suggests something like 1e13-1e17 FLOP/s. You’d also need to cover firing decisions, but various simple neuron models, scaled up by 1e11 neurons, fall into this range as well, and the high end (1e17 FLOP/s) would cover a level of modeling detail that I expect many computational neuroscientists to think unnecessary (single compartment Hodgkin-Huxley). Accounting for the role of other signaling mechanisms probably doesn’t make much of a difference to these numbers.”

That is, this is meant to be a plausible ballpark, covering various types of models that seem plausibly adequate to me.

2.4.1 Too low?

Here are some ways it could be too low:

The choice to budget FLOP/s for synaptic transmission and learning based on spikes through synapses, rather than timesteps at synapses, is doing a lot of work. If we instead budgeted based on timesteps, and used something like 1 ms resolution, we’d start with 1e17-1e18 FLOP/s as a baseline (1 FLOP per timestep per synapse). Finer temporal resolutions, and larger numbers of FLOPs per time-step, would drive these numbers higher.
Some neural processes are extremely temporally precise. For example, neurons in the owl auditory system can detect auditory stimulus timing at a precision of less than ten microseconds.420 These cases may be sufficiently rare, or require combining a sufficient number of less-precise inputs, that they wouldn’t make much of a difference to the overall budget. However, if they are indicative of a need for much finer temporal precision across the board, they could imply large increases.
Dendritic computation might imply much larger FLOP/s budgets than single-compartment Hodgkin-Huxley models.421 Results like Beniaguev et al. (2020) (~1e10 FLOP/s per neuron), discussed above, seem like some initial evidence for this.
Some CNN/RNN models used to predict the activity of retinal neurons are very FLOP/s intensive as well. I discuss this in Section 3.1.
Complex molecular machinery at synapses or inside neurons might implement learning algorithms that would require more than 100 FLOPs per spike through synapse to replicate.422 And I am intrigued by theoretical results showing that various models of synaptic plasticity lead to problems like catastrophic forgetting, and that introducing larger numbers of dynamical variables at synapses might help with online learning.423
One or more of the other signaling mechanisms in the brain might introduce substantially additional FLOP/s burdens (neuromodulation and glia seem like prominent candidates, though I feel most uncertainty about the specific arguments re: gap junctions and alternative forms of axon signaling).
Processes in the brain that take place over longer timescales involve interactions between many biophysical variables in the brain that are not normally included in e.g. simple models of spiking. The length of these timescales might limit the compute burdens such interactions imply, but if not, updating all relevant variables at a frequency similar to the most frequently updated variables could imply much larger compute burdens.424
Some of the basic parameters I’ve used could be too low. The average spike rate might be more like 10 Hz than 0.1-1 Hz (I really doubt 100 Hz); synapse count might be >1e15; Hodgkin-Huxley models might require more FLOP/s than Izhikevich (2004) budgets, etc. Indeed, I’ve been surprised at how uncertain many very basic facts about the brain appear to be, and how wrong previous widely-cited numbers have been (for example, a 10:1 ratio between glia and neurons was widely accepted until it was corrected to roughly 1:1).425

There are also broader considerations that could incline us towards higher numbers by default, and/or skepticism of arguments in favor of the adequacy of simple models:

We might expect evolution to take advantage of every possible mechanism and opportunity available for increasing the speed, efficiency, and sophistication of its information-processing.426 Some forms of computation in biological systems, for example, appear to be extremely energy efficient.427 Indeed, I think that further examination of the sophistication of biological computation in other contexts could well shift my default expectations about the brain’s sophistication substantially (though I have tried to incorporate hazy forecasts in this respect into my current overall view).428
It seems possible that the task-relevant causal-structure of the brain’s biology is just intrinsically ill-suited to replication using digital computer hardware, even once you allow for whatever computational simplifications are available (though neuromorphic hardware might do better). For example, the brain may draw on analog physical primitives,429 continuous (or very fine-grained) temporal dynamics,430 and/or complex biochemical interactions that are cheap for the brain, but very expensive to simulate.431
Limitations on tools and available data plausibly do much to explain the concepts and assumptions most prominent in neuroscience. As these limitations loosen, we may identify much more complex forms of information-processing than the field currently focuses on.432 Indeed, it might be possible to extrapolate from trends in this vein, either in neuroscience or across biology more broadly.433
Various experts mentioned track-records of over-optimism about the ease of progress in biology, including via computational modeling;434 overly-aggressive claims in favor of particular neuroscientific research programs;435 and over-eagerness to think of the brain via in terms of the currently-most-trendy computational/technological paradigms.436 To the extent such track records exist, they could inform skepticism about arguments and expert opinions in a similar reference class (though on their own, they seem like only very indirect support for very large FLOP/s requirements, as many other explanations of such track records are available).

And of course, more basic paradigm mistakes are possible as well.437

This is a long list of routes to higher numbers; perhaps, then, we might expect at least one of them to track the truth. However:

Some particular routes are correlated: for example, worlds in which the brain can implement very sophisticated, un-simplifiable computation at synapses seem more likely to be ones in which it can implement such computation within dendrites as well.438
My vague impression is that experts tend to be inclined towards simplification vs. complexity across the board, rather than in specific patterns that differ widely. If this is true, then the reliability of the assumptions and methods these experts employ might be a source of broader correlations.
Some of these routes are counterbalanced by corresponding routes to lower numbers (e.g., basic parameters could be too high as well as too low; relevant timescales could be more coarse-grained rather than more fine-grained; etc). And there are more general routes to lower numbers as well, which would apply even if some of the considerations surveyed above are sound (see next section).

2.4.2 Too high?

Here are a number of ways 1e13-1e17 FLOP/s might be overkill (I’ll focus, here, on ways that are actively suggested by examination of the brain’s mechanisms, rather than on the generic consideration that for any given way of performing a task, there may be a more efficient way).

2.4.2.1 Neuron populations and manifolds

The framework above focuses on individual neurons and synapses. But this could be too fine-grained. For example, various popular models in neuroscience involve averaging over groups of neurons, and/or treating them as redundant representations of high-level variables.439

Indeed, in vivo recording shows that the dimensionality of the activity of a network of neurons is much smaller than the number of neurons themselves (Wärnberg and Kumar (2017) suggest a subspace spanned by ~10 variables, for local networks consisting of thousands of neurons).440 This kind of low-dimensional subspace is known as a “neural manifold.”441

Some of this redundancy may be about noise: neurons are unreliable elements, so representing high-level variables using groups of them may be more robust.442 Digital computers, though, are noise-free.

In general, the possibility of averaging over or summarizing groups of neurons suggests smaller budgets than the estimates above – possibly much smaller. If I had more time for this project, this would be on the top of my list for further investigation.

2.4.2.2 Transistors and emulation costs

If we imagine applying the mechanistic method to a digital computer we don’t understand, we plausibly end up estimating the FLOP/s required to model the activity of very low-level components: e.g. transistors, logic gates, etc (or worse, to simulate low-level physical processes within transistors). This is much more than the FLOP/s the computer can actually perform.

For example: a V100 has about 2e10 transistors, and a clock speed of ~1e9 Hz.443 A naive mechanistic method estimate for a V100 then, might budget 1 FLOP per clock-tick per transistor: 2e19 FLOP/s. But the chip’s actual computational capacity is ~1e14 FLOP/s – a factor of 2e5 less.

The costs of emulating different computer systems at different levels of detail may also be instructive here. For example, one attempt to simulate a 6502 microprocessor (original clock speed of ~1 Mhz) at the transistor level managed to run the simulated chip at 1 Khz using a computer running at ~1 GHz, suggesting a factor of ~1e6 slow-down.444

Of course, there is no easy mapping between computer components and brain components; and there are components in the brain at lower-levels than neurons (e.g., ion channels, proteins, etc). Still, applying the mechanistic method to digital computers suggests that when we don’t know how the system works, there is no guarantee that we land on right level of abstraction, and hence that estimates based on counting synapses, spikes, etc. could easily be overkill relative to the FLOP/s requirements of the tasks the brain can actually perform (I discuss this issue more in the appendix).

How much overkill is harder to say, at least using the mechanistic method alone: absent knowledge of how a V100 processes information, it’s not clear to me how to modify the mechanistic method to arrive at 1e14 FLOP/s rather than 2e19. Other methods might do better.

Note, though, that applying the mechanistic method without a clear understanding of whether models at the relevant level of abstraction could replicate task-performance at all could easily be “underkill” as well.

2.4.2.3 Do we need the whole brain?

Do we need the whole brain? For some tasks, no. People with parts of their brains missing/removed can still do various things.

A dramatic example is the cerebellum, which contains ~69 billion neurons – ~80% of the neurons in the brain as a whole.445 Some people (a very small number) don’t have cerebellums. Yet there are reports that in some cases, their intelligence is affected only mildly, if at all (though motor control can also be damaged, and some cognitive impairment can be severe).446

Does this mean we can reduce our FLOP/s budget by 80%? I’m skeptical. For one thing, while the cerebellum accounts for a large percentage of the brain’s neurons, it appears to account for a much smaller percentage of other things, including volume (~10%),447 mass (~10%),448 energy consumption (<10%),449 and maybe synapses (and synaptic activity dominates many versions of the estimates above).450

More importantly, though, we’re looking for FLOP/s estimates that apply to the full range of tasks that the brain can perform, and it seems very plausible to me that some of these tasks (neurosurgery? calligraphy?) will rely crucially on the cerebellum. Indeed, the various impairments generally suffered by patients without cerebellums seem suggestive of this.

This last consideration applies across the board, including to other cases in which various types of cognitive function persist in the face of missing parts of the brain,451 neuron/synapse loss,452 etc. That is, while I expect it to be true of many tasks (perhaps even tasks important to AI developers, like natural language processing, scientific reasoning, social modeling, etc.) that you don’t need the whole brain to do them, I also expect us to be able to construct tasks that do require most of the brain. It also seems very surprising, from an evolutionary perspective, if large, resource-intensive chunks of the brain are strictly unnecessary. And the reductions at stake seem unlikely to make an order-of-magnitude difference anyway.

2.4.2.4 Constraints faced by evolution

In designing the brain, evolution faced many constraints less applicable to human designers.453 For example, constraints on:

The brain’s volume.
The brain’s energy consumption.
The growth and maintenance it has to perform.454
The size of the genome it has to be encoded in.455
The comparatively slow and unreliable elements it has to work with.456
Ability to redesign the system from scratch.457

It may be that these constraints explain the brain’s functional organization at sufficiently high-levels that if we understood the overarching principles at work, we would see that much of what the brain does (even internally) is comparatively easy to do with human computers, which can be faster, bigger, more reliable, more energy-intensive, re-designed from scratch, and built using external machines on the basis of designs stored using much larger amounts memory.458 This, too, suggests smaller budgets.

2.4.3 Beyond the mechanistic method

Overall, I find the considerations pointing to the adequacy of smaller budgets more compelling than the considerations pointing to the necessity of larger ones (though it also seems, in general, easier to show that X is enough, than that X is strictly required – an asymmetry present throughout the report). But the uncertainties in either direction rightly prompt dissatisfaction with the mechanistic method’s robustness. Is there a better approach?

3 The functional method

Let’s turn to the functional method, which attempts to identify a portion of the brain whose function we can already approximate with artificial systems, together with the computational costs of doing so, and then to scale up to an estimate for the brain as a whole.

Various attempts at this method have been made. To limit the scope of the section, I’m going to focus on two categories: estimates based on the retina, and estimates based on the visual cortex. But I expect many problems to generalize.

As a preview of my conclusion: I give less weight to these estimates than to the mechanistic method, primarily due to uncertainties about (a) what the relevant portion of the brain is doing (in the case of the visual cortex), (b) differences between that portion and the rest of the brain (in the case of the retina), and (c) the FLOP/s required to fully replicate the functions in question. However, I take visual cortex estimates as some weak evidence that the mechanistic method range above (1e13-1e17 FLOP/s) isn’t much too low. Some estimates based on recent deep neural network models of retinal neurons point to higher numbers. I take these on their own as even weaker evidence, but I think they’re worth understanding better.

3.1 The retina

As I discussed in Section 2.1.2.1.2, the retina is one of the best-understood neural circuits.459 Could it serve as a basis for a functional method estimate?

3.1.1 Retina FLOP/s

We don’t yet have very good artificial retinas (though development efforts are ongoing).460 However, this has a lot to do with engineering challenges – e.g., building devices that interface with the optic nerve in the right way.461 Even absent fully functional artificial retinas, we may be able to estimate the FLOP/s required to replicate retinal computation.

Moravec (1988, 1998, and 2008) offers some estimates in this vein.462 He treats the retina as performing two types of operations – a “center surround” operation, akin to detecting an edge, and a “motion detection” operation – and reports that in his experience with robot vision, such operations take around 100 calculations to perform.463 He then divides the visual field into patches, processing of which gets sent to a corresponding fiber of the optic nerve, and budgets ten edge/motion detection operations per patch per second (ten frames per second is roughly the frequency at which individual images become indistinguishable for humans).464 This yields an overall estimate of:

1e6 ganglion cells × 100 calculations per edge/motion detection × 10 edge/motion detections per second = 1e9 calculations/sec for the whole retina

Is this right? At the least, it’s incomplete: neuroscientists have catalogued a wide variety of computations that occur in the retina, other than edge and motion detection (I’m not sure how many were known at the time). For example: the retina can anticipate motion,465 it can signal that a predicted stimulus is absent,466 it can adapt to different lighting conditions,467 and it can suppress vision during saccades.468 And further computations may await discovery.469

But since Moravec’s estimates, we’ve also made progress in modeling retinal computation. Can recent models provide better estimates?

Some of these models were included in Figure 7. Of these, it seems best to focus on models trained on naturalistic stimuli, retinal responses to which have proven more difficult to capture than responses to more artificial stimuli.470 RNN/CNN neural network models appear to have more success at this than some other variants,471 so I’ll focus on two of these:

Maheswaranathan et al. (2019), who train a three-layer CNN to predict the outputs of ganglion cells in response to naturalistic stimuli, and achieve a correlation coefficient greater than 0.7 (retinal reliability is 0.8).
Batty et al. (2017), use a shared, two-layer RNN on a similar task, and capture around ~80% of explainable variance across experiments and cell types.

These models are not full replications of human retinal computation. Gaps include:

Their accuracy can still be improved, and what’s missing might matter.472
The models have only been trained on a very narrow class of stimuli.473
Inputs are small (50 × 50 pixels or less) and black-and-white (though I think they only need to be as large as the relevant ganglion cell’s receptive field).
These models don’t include adaptation, either (though one expert did not expect adaptation to make much of a difference to overall computational costs).474
We probably need to capture correlations across cells, in addition to individual cell responses.475
Maheswaranathan et al. (2019) use salamander retinal ganglion cells, results from which may not generalize well to humans (Batty et al. (2017) use primate cells, which seem better).476
There are a number of other possible gaps (see endnote).477

What sort of FLOP/s budgets would the above models imply, if they were adequate?

The CNN in Maheswaranathan et al. (2019) requires about 2e10 FLOPs to predict the output of one ganglion cell over one second.478 However, adding more ganglion cells only increases the costs in the last layer of the network. A typical experiment involves 5-15 cells, suggesting ~2e9 FLOP/s per cell, and one of the co-authors on the paper (Prof. Baccus) could easily imagine scaling up to 676 cells (the size of the last layer), which would cost ~20.4 billion FLOP/s (3e7 per cell); or 2500 cells (the size of the input), which would cost 22.4 billion FLOP/s (~1e7 per cell).479 I’ll use this last number, which suggests ~1e7 FLOP/s per retinal ganglion cell. However, I don’t feel that I have a clear grip on how to pick an appropriate number of cells.
I estimate that the RNN in Batty et al. (2017) requires around 1e5 FLOP for one 0.83 ms bin.480 I’m less clear on how this scales per ganglion cell, so I’ll assume one cell for the whole network: e.g., ~1e8 FLOP/s per retinal ganglion cell.

These are much higher than Moravec’s estimate of 1000 calculations/s per ganglion cell, and they result in much higher estimates for the whole retina: 1e13 FLOP/s and 1e14 FLOP/s, respectively (assuming 1e6 ganglion cells).481 But it’s also a somewhat different task: that is, predicting retinal spike trains, as opposed to motion/edge detection more broadly.

Note, also, that in both cases, the FLOPs costs are dominated by the first layer of the network, which processes the input, so costs would scale with the size of the input (though the input size relevant to an individual ganglion cell will presumably be limited by the spatial extent of its receptive field).482 And in general, the scale-up to the whole retina here is very uncertain, as I feel very uninformed about what it would actually look like to run versions of these models on such a scale (how much of the network could be reused for different cells, what size of receptive field each cell would need, etc).

3.1.2 From retina to brain

What does it look like to scale up from these estimates to the brain as a whole? Here a few ways of doing so, and the results:

**Figure 14. Estimates of the FLOP/s to replicate retinal computation, scaled up to the whole brain based on various factors.**
BASIS FOR SCALING	ROUGH SCALING FACTOR	APPLIED TO: MORAVEC ESTIMATE (1E9 CALCS/S)	APPLIED TO:MAHESWARANATHAN ET AL. (2019)ESTIMATE (1E13 FLOP/S)	APPLIED TO: BATTY ET AL. (2019) ESTIMATE (1E14 FLOP/S)
Mass	4e3-1e5483	4e12-1e14	4e16-1e18	4e17-1e19
Volume	4e3-1e5484	4e12-1e14	4e16-1e18	4e17-1e19
Neurons	1e3-1e4485	1e12-1e13	1e16-1e17	1e17-1e18
Synapses	1e5-1e6486	1e14-1e15	1e18-1e19	1e19-1e20
Energy use	4e3487	4e12	4e16	4e17
Overall range	1e3-1e6	1e12-1e15	1e16-1e19	1e17-1e20

The full range here runs from 1e12 calc/s (low-end Moravec) to 1e20 FLOP/s (high-end Batty et al. (2017)). Moravec argues for scaling based on a combination of mass and volume, rather than neuron count, on the grounds that the retina’s neurons are unusually small and closely packed, and that the brain can shrink neurons while keeping overall costs in energy and materials constant.488 Anders Sandberg objects to volume, due to differences in “tissue structure and constraints.”489 He prefers neuron count.490

Regardless of how we scale, though, the retina remains different from the rest of the brain in many ways. Here are a few:

The retina is probably less plastic.491
The retina is highly specialized for performing one particular set of tasks.492
The retina is subject to unique physical constraints.493
Retinal circuitry has lower connectivity, and exhibits less recurrence.494
We are further from having catalogued all the cell types in the brain than in the retina.495
Some of the possible complications discussed in the mechanistic method section (for example, some forms of dendritic computation, and some alternative signaling mechanisms like ephaptic effects) may not be present in the retina in the same way.496

Not all of these, though, seem to clearly imply higher FLOP/s burdens per unit something (cell, synapse, volume, etc.) in the brain than in the retina (they just suggest possible differences). Indeed, Moravec argues that given the importance of vision, the retina may be “evolutionarily more perfected, i.e. computationally dense, than the average neural structure.”497 And various retina experts were fairly sympathetic to scaling up from the retina to the whole brain.498

Where does this leave us? Overall, I think that the estimates based on the RNN/CNN models discussed above (1e16-1e20 FLOP/s) are some weak evidence for FLOP/s requirements higher than the mechanistic method range discussed above (1e13-1e17 FLOP/s). And these could yet be under-estimates, either because more FLOP/s are required to replicate retinal ganglion cell outputs with adequate accuracy across all stimuli; or because neural computation in the brain is more complicated, per relevant unit (volume, neuron, watt, etc.) than in the retina (the low plasticity of the retina seems to me like an especially salient difference).

Why only weak evidence? Partly because I’m very uncertain about how it actually looks like to scale these models up to the retina as a whole. And as I discussed in Section 2.1.2.2, I’m wary of updating too much based on a few studies I haven’t investigated in depth. What’s more, it seems plausible to me that these models, while better than current simpler models at fitting retinal spike trains, use more FLOP/s (possibly much more) than are required to do what the retina does. Reasons include:

The FLOP/s budgets for these RNN/CNN retina models depend on specific implementation choices (for example, input size and architecture) that don’t seem to reflect model complexity that has yet been found necessary. Bigger models will generally allow better predictions, but our efforts to predict retinal spikes using deep neural networks seem to be in early stages, and it doesn’t seem like we yet have enough data to ground strong claims about the network size required for a given level of accuracy (and we don’t know what level of accuracy is necessary, either).
I’m struck by how much smaller Moravec’s estimate is. It’s true that this estimate is incomplete in its coverage of retinal computation – but it surprises me somewhat if (a) his estimates for edge and motion detection are correct (Prof. Barak Pearlmutter expected Moravec’s robotic vision estimates to be accurate),499 but (b) the other functions he leaves out result in an increase of 4-5 orders of magnitude. Part of the difference here might come from his focus on high-level tasks, rather than replicating spike trains.
The CNN in Maheswaranathan et al. (2019) would require ~2e10 FLOP/s to predict the outputs of 2500 cells in response to a 50 × 50 input. But various vision models discussed in the next section take in larger inputs (224 × 224 × 3),500 and run on comparable FLOP/s (~1e10 FLOP/s for an EfficientNet-B2 run at 10 Hz). It seems plausible to me these vision models cover some non-trivial fraction of what the retina does (e.g., edge detection), along with much that it doesn’t do.

That said, these CNN/RNN results, together with the Beniaguev et al. (2020) results discussed in Section 2.1.2.2, suggest a possible larger pattern: recent DNN models used to predict the outputs of neurons and detailed neuron models appear to be quite FLOP/s intensive. It’s possible these DNNs are overkill. But they could also indicate complexity that simpler models don’t capture. Further experiments in this vein (especially ones emphasizing model efficiency) would shed helpful light.

3.2 Visual cortex

Let’s turn to a different application of the functional method, which treats deep neural networks (DNNs) trained on vision tasks as automating some portion of the visual cortex.501

Such networks can classify full-color images into 1000 different categories502 with something like human-level accuracy.503 They can also localize/assign pixels to multiple identified objects, identify points of interest in an image, and generate captions, but I’ll focus here on image classification (I’m less confident about the comparisons with humans in the other cases).504

What’s more, the representations learned by deep neural networks trained on vision tasks turn out to be state-of-the-art predictors of neural activity in the visual cortex (though the state of the art is not obviously impressive in an absolute sense505).506 Example results include:

Cadena et al. (2019): a model based on representations learned by a DNN trained on image classification can explain 51.6% of explainable variance of spiking activity in monkey primary visual cortex (V1, an area involved in early visual processing) in response to natural images. A three-layer DNN trained to predict neural data explains 49.8%. The authors report that these models both outperform the previous state of the art.507
Yamins et al. (2014) show that layers of a DNN trained on object categorization can be used to achieve what was then state of the art prediction of spiking activity in the monkey Inferior Temporal cortex (IT, an area thought to be involved in a late stage of hierarchical visual processing) – ~50% of explainable variance explained (though I think the best models can now do better).508 Similar models can also be used to predict spiking activity in area V4 (another area involved in later-stage visual processing),509 as well as fMRI activity in IT.510 The accuracy of the predictions appears to correlate with the network’s performance on image classification (though the correlation weakens for some of the models best at the task).511

We can also look more directly at the features that units in an image classifier detect. Here, too, we see interesting neuroscientific parallels. For example:

Neurons in V1 are sensitive to various low-level features of visual input, such as lines and edges oriented in different ways. Some units in the early layers of image classifiers appear to detect similar features. For example, Gabor filters, often used to model V1, are found in such early layers.512
V4 has traditionally been thought to detect features like colors and curves.513 These, too, are detected by units in image classifiers.514 What’s more, such networks can be used to create images that can predictably drive firing rates of V4 neurons beyond naturally occurring levels.515

Exactly what to take away from these results isn’t clear to me. One hypothesis, offered by Yamins and DiCarlo (2016), is that hierarchically organized neural networks (a class that includes both the human visual system, and these DNNs) converge on a relatively small set of efficiently-learnable solutions to object categorization tasks.516 But other, more trivial explanations may be available as well,517 and superficial comparisons between human and machine perception can be misleading.518

Still, it seems plausible that at the very least, there are interesting similarities between information-processing occurring in (a) the visual cortex and (b) DNNs trained on vision tasks. Can we turn this into a functional method estimate?

Here are a few of the uncertainties involved.

3.2.1 What’s happening in the visual cortex?

One central problem is that there’s clearly a lot happening in the visual cortex other than image classification of the kind these models perform.

In general, functional method estimates fit best with a traditional view in systems neuroscience, according to which chunks of the brain are highly specialized for particular tasks. But a number of experts I spoke to thought this view inaccurate.519 In reality, cortical regions are highly interconnected, and different types of signals show up all over the place. Motor behavior in mice, for example, predicts activity in V1 (indeed, such behaviors are represented using the same neurons that represent visual stimuli);520 and V1 responses to identical visual stimuli alter based on a mouse’s estimate of its position in a virtual-reality maze.521 Indeed, Cadena et al. (2019) recorded from 307 monkey V1 neurons, and found that only in about half of them could more than 15% of the variance in their spiking be explained by the visual stimulus (the average, in those neurons, was ~28%).522

Various forms of prediction are also reflected in the visual system, even in very early layers. For example, V1 can fill in missing representations in a gappy motion stimulus.523 Simple image classifiers don’t do this. Neurons in the visual cortex also learn over time, whereas the weights in a typical image classifier are static.524 And there are various other differences besides.525

More generally, as elsewhere in the brain, there’s a lot we don’t know about what the visual cortex is doing.526 And “vision” as a whole, while hard to define clearly, intuitively involves much more than classifying images into categories (for example, visual representations seem closely tied to behavioral affordances, 3D models of a spatial environment, predictions, high-level meanings and associations, etc.).527

3.2.2 What’s human level?

Even if we could estimate what percentage of the visual cortex is devoted to image recognition of the type these models perform, it’s also unclear how much such models match human-level performance on that task. For example:

DNNs are notoriously vulnerable to adversarial examples,528 some of which are naturally occurring.529 The extent to which humans are analogously vulnerable remains an open question.530
DNN image classifiers can generalize poorly to data sets they weren’t trained on. Barbu et al. (2019), for example, report a 40-45% drop in performance on the ObjectNet test set, constructed from real-world examples (though Kolesnikov et al. (2020) recently improved the ObjectNet state of the art by 25%, reaching 80% top-five accuracy).531 See figure below, and endnote, for some other examples.532

Figure 15: Examples of generalization failures. From Geirhos et al. (2020), Figure 3, p. 8, reprinted with permission, and unaltered. Original caption: “Both human and machine vision generalise, but they generalise very differently. Left: image pairs that belong to the same category for humans, but not for DNNs. Right: image pairs assigned to the same category by a variety of DNNs, but not by humans.”
The common ILSVRC benchmark involves classifying images from 1000 categories. But humans can plausibly classify objects from more (much more?) than 10,000 categories, including very particular categories like “that one mug” or “the chair from the living room.”533 Indeed, it’s unclear to me, conceptually, how to draw the line between classifying an object (“house,” “dog,” “child”) and thinking/feeling/predicting (“house I’d like to live in,” “dog that I love,” “child in danger”).534 That said, it’s possible that all of these categories draw on similar low-level visual features detected in early stages of processing.
The resolution of the human visual system may be finer than the resolution of typical ImageNet images. The optic nerve has roughly 1 million retinal ganglion cells that carry input from the retina, and the retina as a whole has about 100 million photoreceptor cells.535 A typical input to an image classifier is 224 × 224 × 3: ~150,000 input values (though some inputs are larger).536

That said, DNNs may also be superior to the human visual system in ways. For example, Geirhos et al. (2018) compared DNN and human performance at identifying objects presented for 200 ms, and found that DNNs outperformed humans by >5% classification accuracy on images from the training distribution (humans generally did better overall when the images were altered).537 And human vision is subject to its own illusions, blind spots, shortcuts, etc.538 And I certainly don’t know that many species of dog. Overall, though, the human advantages here seem more impressive to me.

Note, also, that the question here is not whether DNNs are processing visual information exactly like humans do. For example, in order to qualify as human-level, the models don’t need to make the same sorts of mistakes humans do. What matters is high-level task performance.

3.2.3 Making up some numbers

Suppose we forge ahead with a very loose functional method estimate, despite these uncertainties. What results?

An EfficientNet-B2, capable of a roughly human-level 95% top-five accuracy on ImageNet classification, takes 1e9 FLOPs for a forward pass – though note that if we assume sparse FLOPs (e.g., no costs for multiplying by or adding 0), as we did for the mechanistic method, this number would be lower;539 and it might be possible to prune/compress the model further (though EfficientNet-B2 is already optimized to minimize FLOPs).540

Humans can recognize ~ten images per second (though the actual process of assigning labels to ImageNet images takes much longer).541 If we ran EfficientNet-B2 ten times per second, this would require ~1e10 FLOP/s.

On one estimate from 1995, V1 in humans has about 3e8 neurons.542 However, based on more recent estimates in chimpanzees, I think this estimate might be low, possibly by an order to magnitude (see endnote for explanation).543 I’ll use 3e8-3e9 – e.g., ~0.3%-3% of the brain’s neurons.

On an initial search, I haven’t been able to find good sources for neuron count in the visual cortex as a whole, which includes areas V2-V5.544 I’ll use 1e9-1e10 neurons – e.g., ~1-10% of the brain’s neurons as a whole – but this is just a ballpark.545

If we focused on percentage of volume, weight, energy consumption, and synapses, the relevant percentages might be larger (since the cortex accounts for a larger percentage of these than of the brain’s neurons).546

We can distill the other uncertainties from 3.2.1 and 3.2.2 into two numbers:

The percentage of its information-processing capacity that the visual cortex devotes to tasks analogous to image classification, when it performs them.
The factor increase in FLOP/s required to reach human-level performance on this task (if any), relative to the FLOP/s costs of an EfficientNet-B2 run 10 times per second.

Absent a specific chunk of the visual cortex devoted exclusively to this task, the percentage in (1) does not have an obvious physiological interpretation in terms of e.g. volume or number of neurons.547 Still, something like percentage of spikes or of signaling-based energy consumption driven by performing the task might be a loose guide.548

Of course, the resources that a brain uses in performing a task are not always indicative of the FLOP/s the task requires. Multiplying two 32-bit numbers in your head, for example, uses lots of spikes, energy, etc., but requires only one FLOP. And naively, it seems unlikely that the neural resources used in playing e.g. Tic-Tac-Toe, Checkers, Chess, and Go will be a simple function of the FLOP/s that have thus far been found necessary to match human-level performance. However, the brain was not optimized to multiply large numbers or play board games. Identifying visual objects (e.g. predators, food) seems like a better test of its computational potential.549

Can we say anything about (1)? Obviously, it’s difficult. The variance in the activity in the visual cortex explained by DNN image classifiers might provide some quantitative anchor (this appears to be at least 7% in V1, and possibly much higher in other regions), but I haven’t explored this much.550 Still, to the extent (1) makes sense at all, it should be macroscopic enough to explain the results discussed at the beginning of this section (e.g., it should make interesting parallels between the feature detection in DNNs and the visual cortex noticeable using tools like fMRI and spike recordings), along with other modeling successes in visual neuroscience I haven’t explored.551 I’ll use 1% of V1 as a low end,552 and 10% of the visual cortex as a whole as a high end, with 1% of the visual cortex as a rough middle.

My biggest hesitation about these numbers comes from the conceptual ambiguities involved in estimating this type of parameter at all. Consider: “what fraction of a horse’s legs does a wheelbarrow automate?”553 It’s not clear that “of course it’s hard to say precisely, but surely at least a millionth, right?” is a sensible answer – and the problem isn’t that the true answer is a billionth instead. It seems possible that comparisons between DNNs and the visual cortex are similar.

We also need to scale up the size of the DNN in question by (2), to reflect the FLOPs costs of fully human-level image classification. What is (2)? I haven’t looked into it much, and I feel very uncertain. Some of the differences discussed in 3.2.2 – for example, differences in input size, or in number of categories (assuming we can settle on a meaningful estimate for the number of categories humans can recognize) – might be relatively easy to adjust for.554 But others, such as the FLOPs required to run models that are only as vulnerable to adversarial examples as humans are, or that can generalize as well as humans can, might involve much more involved and difficult extrapolations.

I’m not going to explore these adjustments in detail here. Here are a few possible factors:

10x (150k input values vs. ~1 million retinal ganglion cells)
100x (~factor increase in EfficientNet-B2 FLOPs required to run a BiT-L model, which exhibits better, though still imperfect, generalization to real-world datasets like ObjectNet).555
1000x (10x on top of a Bit-L model, for additional improvements. I basically just pulled this number out of thin air, and it’s by no means an upper bound).

Putting these estimates for (1) and (2) together:

**Figure 16: Functional method estimates based on the visual cortex.**
ESTIMATE TYPE	ASSUMED PERCENTAGE OF VISUAL CORTEX INFORMATION-PROCESSING CAPACITY USED FOR TASKS ANALOGOUS TO IMAGE CLASSIFICATION, WHEN PERFORMED	IMPLIED PERCENTAGE OF THE WHOLE BRAIN’S CAPACITY (BASED ON NEURON COUNT)	ASSUMED FACTOR INCREASE IN 10 HZ EFFICIENTNET-B2 FLOP/S (1E10) REQUIRED TO REACH FULLY HUMAN-LEVEL IMAGE CLASSIFICATION	WHOLE BRAIN FLOP/S ESTIMATE RESULTING FROM THESE ASSUMPTIONS
Low-end	10%	0.1%-1%	10x	1e13-1e14
Middle	1%	0.01%-0.1%	100x	1e15-1e16
High-end	0.3% (1% of V1)	0.003%-0.03%	1000x	3e16-3e17

Obviously, the numbers for (1) and (2) here are very made-up. The question of how high (2) could go, for example, seems very salient. And the conceptual ambiguities involved in comparing what the human visual system is doing when it classifies an image, vs. what a DNN is doing, caution against relying on what might appear to be conservative bounds.

What’s more, glancing at different models, image classification (that is, assigning labels to whole images) appears to require fewer FLOPs than other vision tasks in deep learning, such as object detection (that is, identifying and localizing multiple objects in an image). For example: an EfficientDet-D7, a close to state of the art object-detection model optimized for efficiency, uses 3e11 FLOPs per forward pass – 300x more than an EfficientNet-B2.556 So using this sort of model as a baseline instead could add a few orders of magnitude. And such a choice would raise its own questions about what human-level performance on the relevant task looks like.

Overall, I hold functional method estimates based on current DNN vision models very lightly – even more lightly, for example, than the mechanistic method estimates above. Still, I don’t think them entirely uninformative. For example, it is at least interesting to me that you need to treat an EfficientNet-B2 as running on e.g. ~0.1% of the FLOPs of a model that would cover ~1% of V1, in order to get whole brain estimates substantially above 1e17 FLOP/s – the top end of the mechanistic method range I discussed above. This weakly suggests to me that such a range is not way too low.

3.3 Other functional method estimates

There are various other functional method estimates in the literature. Here are three:557

**Figure 17: Other functional method estimates in the literature.**
SOURCE	TASK	ARTIFICIAL SYSTEM	COSTS OF HUMAN-LEVEL PERFORMANCE	ESTIMATED PORTION OF BRAIN	RESULTING ESTIMATE FOR WHOLE BRAIN
Drexler (2019)558	Speech recognition	DeepSpeech2	1e9 FLOP/s	>0.1%	1e12 FLOP/s
Drexler (2019)559	Translation	Google Neural Machine Translation	1e11 FLOP/s (1 sentence per second)	1%	1e13 FLOP/s
Kurzweil (2005)560	Sound localization	Work by Lloyd Watts	1e11 calculations/s	0.1%	1e14 calculations/s

I haven’t attempted to vet these estimates. And we can imagine others. Possibly instructive recent work includes:

Kell et al. (2018), who suggest that ANNs trained to recognize sounds can predict neural activity in the cortex.561
Banino et al. (2018) and Cueva and Wei (2018), who suggest that ANNs trained on navigation tasks develop grid-like representations, akin to grid cells in biological circuits.562
Merel et al. (2020), who develop a virtual rodent, which might allow productive comparison with the capabilities of a real rodent.563

That said, I expect other functional method estimates to encounter difficulties analogous to those discussed in section 3.2: e.g., difficulties identifying (a) the percentage of the brain’s capacity devoted to a given task, (b) what human-level performance looks like, and (c) the FLOP/s sufficient to match this level.

4 The limit method

Let’s turn to a third method, which attempts to upper bound required FLOP/s by appealing to physical limits.

Some such bounds are too high to be helpful. Lloyd (2000), for example, calculates that a 1 kg, 1 liter laptop (the brain is roughly 1.5 kg and 1.5 liters) can perform a maximum of 5e50 operations per second, and store a maximum of 1e31 bits. Its memory, though, “looks like a thermonuclear explosion.”564 For present purposes, such idealizations aren’t informative.

Other physical limits, though, might be more so. I’ll focus on “Landauer’s principle,” which specifies the minimum energy costs of erasing bits (more description below). Standard FLOPs (that is, the FLOPs performed by human-engineered computers) erase bits, which means that an idealized computer running on the brain’s energy budget (~20W) can only perform so many standard FLOP/s: specifically, ~7e21 (~1e21 if we assume 8-bit FLOPs, and ~1e19 if we assume current digital multiplier implementations).565

Does this upper bound the FLOP/s required to match the brain’s task-performance? In principle, no. The brain need not be performing operations that resemble standard FLOPs, and more generally, bit-erasures are not a universal currency of computational complexity.566 In theory, for example, factorizing a semiprime requires no bit-erasures, since the mapping from inputs to outputs is 1-1.567 But we’d need many FLOPs to do it. Indeed, in principle, it appears possible to perform arbitrarily complicated computations with very few bit erasures, with manageable algorithmic overheads (though there is at least some ongoing controversy about this).568

Absent a simple upper bound, then, the question is what we can say about the following quantity:

FLOP/s required to match the brain’s task performance ÷ bit-erasures/s in the brain

Various experts I spoke to about the limit method (though not all569) thought it likely that this quantity is less than 1 – indeed, multiple orders of magnitude less.570 They gave various arguments, which I’ll roughly group into (a) algorithmic arguments (Section 4.2.1), and (b) hardware arguments (Section 4.2.2). Of these, the hardware arguments seem to me stronger, but they also don’t seem to me to rely very directly on Landauer’s principle in particular.

Whether the bound in question emerges primarily from Landauer’s principle or not, though, I’m inclined to defer to the judgment of these experts overall.571 And even if their arguments to do not treat the brain entirely as a black box, a number of the considerations these arguments appeal to seem to apply in scenarios where more specific assumptions employed by other methods are incorrect. This makes them an independent source of evidence.

Note, as well, that e.g. 1e21 FLOP/s isn’t too far from some of the numbers that have come up in previous sections. And some experts either take numbers in this range or higher seriously, or are agnostic about them.572 In this sense, the bound in question, if sound, would provide an informative constraint.

4.1 Bit-erasures in the brain

4.1.1 Landauer’s principle

Landauer’s principle says that implementing a computation that erases information requires transferring energy to the environment – in particular, k × T × ln2 per bit erased, where k is Boltzmann’s constant, and T is the absolute temperature of the environment.573

I’ll define a computation, here, as a mapping from input logical states to probability distributions over output logical states, where logical states are sets of physical microstates treated as equivalent for computational purposes;574 and I’ll use “operation” to refer to a comparatively basic computation implemented as part of implementing another computation. Landauer’s principle emerges from the close relationship between changes in logical entropy (understood as the Shannon entropy of the probability distribution over logical states) and thermodynamic entropy (understood as the natural logarithm of the number of possible microstates, multiplied by Boltzmann’s constant).575

In particular, if (given an initial probability distribution over inputs) a computation involves decreasing logical entropy (call a one bit decrease a “logical bit-erasure”),576 then implementing this computation repeatedly using a finite physical system (e.g., a computer) eventually requires increasing the thermodynamic entropy of the computer’s environment – otherwise, the total thermodynamic entropy of the computer and the environment in combination will decrease, in violation of the second law of thermodynamics.577

Landauer’s principle quantifies the energy costs of this increase.578 These costs arise from the relationship between the energy and the thermodynamic entropy of a system: broadly, if a system’s energy increases, it can be in more microstates, and hence its entropy increases.579 Temperature, fundamentally, is defined by this exchange rate.580

There has been some controversy over Landauer’s principle,581 and some of the relevant physics has been worked out more rigorously since Landauer’s original paper.582 But the basic thrust emerges from very fundamental physics, and my understanding is that it’s widely accepted by experts.583 A number of recent results also purport to have validated Landauer’s principle empirically.584

4.1.2 Overall bit-erasures

Let’s assume that Landauer’s principle caps the bit-erasures the brain can implement. What bit-erasure budget does this imply?

Most estimates I’ve seen of the brain’s energy budget vary between ~10-20W (Joules/second).585 But not all of this energy goes to computation:

Loose estimates suggest that 40% of energy use in the brain,586 and 25% in cortical gray matter,587 goes towards non-signaling tasks.588
Some signaling energy is plausibly used for moving information from one place to another, rather than computing with it. Harris and Attwell (2012), for example, estimate that action potentials use 17% of the energy in grey matter (though much less in white matter).589

That said, these don’t initially appear to be order-of-magnitude level adjustments. I’ll use 20W as a high end.

The brain operates at roughly 310 Kelvin, as does the body.590 Even if the air surrounding the body is colder, Dr. Jess Riedel suggested that it’s the temperature of the skull and blood that’s relevant, as the brain has to push entropy into the environment via these conduits.591

At 310 K, k × T × ln2 Joules results in a minimum energy emission of 3e-21 Joules per bit erasure.592 With a 20W budget, this allows no more than 7e21 bit erasures per second in the brain overall.593 This simple estimate passes over some complexities (see endnote), but I’ll use it as a first pass.594

4.2 From bit-erasures to FLOP/s

Can we get from this to a bound on required FLOP/s?

If the brain were performing standard FLOPs, it would be easy. A standard FLOP takes two n-bit numbers, and produces another n-bit number. So absent active steps to save the inputs, you’ve erased at least n bits.595 7e21 bit-erasures/s, then, would imply a maximum of e.g. ~2e21 4-bit FLOP/s, 9e20 8-bit FLOP/s, and so forth, for a computer running on 20W at 310 Kelvin.

And the intermediate steps involved in transforming inputs into outputs erase bits as well. For example, Hänninen et al. (2011) suggest that on current digital multiplier implementations, the most efficient form of n-bit multiplication requires 8 × n² bit-erasures – e.g., 128 for a 4-bit multiplication, and 512 for an 8-bit multiplication.596 This would suggest a maximum of ~5e19 4-bit digital multiplications, and ~1e19 8-bit multiplications (though analog implementations may be much more efficient).597

And FLOPs in actual digital computers appear to erase even more bits than this – ~1 bit-erasure per transistor switch involved in the operation.598 Sarpeshkar (1998) suggests 3000 transistors for an 8-bit digital multiply (though only 4-8 in for analog implementations);599 Asadi and Navi (2007) suggest >20,000 for a 32-bit multiply.600

Perhaps for some, comfortable assuming that the brain’s operations are relevantly like standard FLOPs, this is enough. But a robust upper bound should not assume this. The brain implements some causal structure that allows it to perform tasks, which can in principle be replicated using FLOP/s, but which itself could in principle take a wide variety of unfamiliar forms. Landauer’s principle tells us that this causal structure, represented as a set of (possibly stochastic) transitions between logical states, cannot involve erasing more than 7e21 bits/second.601 It doesn’t tell us anything, directly, about the FLOP/s required to replicate the relevant transitions, and/or perform the relevant tasks.602

Here’s an analogy. Suppose that you’re wondering how many bricks you need to build a bridge across the local river, and you know that a single brick always requires a pound of mortar. You learn that the “old bridge” across the river was built using no more than 100,000 pounds of mortar. If the old bridge is made of bricks, then you can infer that 100,000 bricks is enough. If the old bridge is made of steel, though, you can’t: even assuming that a brick can do anything y units of steel can do, y units of steel might require less (maybe much less) than a pound of mortar, so the old bridge could still be built with more than 100,000×y units of steel.

Obviously, the connection between FLOPs, bit-erasures, and the brain’s operations may be tighter than that between bricks, mortar, and steel. But conceptually, the point stands: unless we assume that the brain performs standard FLOPs, moving from bit-erasures to FLOPs requires further arguments. I’ll consider two types.

4.2.1 Algorithmic arguments

We might think that any algorithm useful for information-processing, whether implemented using standard FLOPs or no, will require erasing lots of logical bits.

In theory, this appears to be false (though there is at least some ongoing controversy, related to the bit-erasures implied by repeatedly reading/writing inputs and outputs).603 Any computation can be performed using logically reversible operations (that is, operations that allow you to reconstruct the input on the basis of the output), which do not erase bits.604 For example, in theory, you can make multiplication reversible just by saving one of the inputs.605 And my understanding is that the algorithmic overheads involved in using logically reversible operations, instead of logically irreversible ones – e.g., additional memory to save intermediate results, additional processing time to “rewind” computations606 – are fairly manageable, something like a small multiplicative factor in running time and circuit size.607

In practice, however, two experts I spoke with expected the brain’s information-processing to involve lots of logical bit-erasures. Reasons included:

When humans write software to perform tasks, it erases lots of bits.608
Dr. Jess Riedel suggested that processing sensory data requires extracting answers to high-level questions (e.g., “should I dodge this flying rock to the left or the right?”) from very complex intermediate systems (e.g., trillions of photons hitting the eye), which involves throwing out a lot of information.609
Prof. Jared Kaplan noted that FLOPs erase bits, and in general, he expects order one bit-erasures per operation in computational systems. You generally don’t do a lot of complicated things with a single bit before erasing it (though there are some exceptions to this). His intuition about this was informed by his understanding of simple operations you can do with small amounts of information.610

If one imagines erasing lots of bits as the “default,” then you can also argue that the brain would need to be unrealistically energy-efficient (see next section) in order to justify any overheads incurred by transitioning to more reversible forms of computation.611 Dr. Paul Christiano noted, though, that if evolution had access to computational mechanisms capable of implementing useful, logically-reversible operations, brains may have evolved a reliance on them from the start.612

We can also look at models of neural computation to see what bit-erasures they imply. There is some risk, here, of rendering the limit method uninformative (e.g., if you’ve already decided how the brain computes, you can just estimate required FLOP/s directly).613 But it could still be helpful. For example:

Some kinds of logical irreversibility may apply to large swaths of hypotheses about how the brain computes (e.g., hypotheses on which the membrane potential, which is routinely reset, carries task-relevant information).
Some specific hypotheses (e.g., each neuron is equivalent to X-type of very large neural network) might imply bit-erasures incompatible with Landauer’s bound.
If the brain is erasing lots of bits in one context, this might indicate that it does so elsewhere too, or everywhere.

Of course, it’s a further step from “the brain is probably erasing lots of logical bits” to “FLOP/s required to replicate the brain’s task-performance ÷ bit-erasures per second in the brain ≤1,” just as it’s a further step from “the old bridge was probably built using lots of mortar” to “bricks I’ll need ÷ pounds of mortar used for the old bridge ≤1.” One needs claims like:

A minimal, computationally useful operation in the brain probably erases at least one logical bit, on average.

One FLOP is probably enough to capture what matters about such an operation, on average.

Prof. Kaplan and Dr. Riedel both seemed to expect something like (1) and (2) to be true, and they seem fairly plausible to me as well. But the positive algorithmic arguments just listed don’t themselves seem to me obviously decisive.

4.2.2 Hardware arguments

Another class of arguments appeals to the energy dissipated by the brain’s computational mechanisms. After all, for required FLOPs per logical bit-erasure to be >1, it would need to be the case that required FLOPs per ~0.69kT of energy dissipation is >1 as well.

For example, in combination with (2) above, we might argue instead for:

1*. A minimal, computationally useful operation in the brain probably dissipates at least 0.69kT, on average

One possibly instructive comparison is with the field of reversible computing, which aspires to build computers that dissipate arbitrarily small amounts of energy per operation.614 This requires logically reversible algorithms (since otherwise, Landauer’s principle will set a minimum energy cost per operation), but it also requires extremely non-dissipative hardware – indeed, hardware that is close to thermodynamically reversible (e.g., its operation creates negligible amounts of overall thermodynamic entropy).

Useful, scalable hardware of this kind would need to be really fancy. As Dr. Michael Frank puts it, it would require “a level of device engineering that’s so precise and sophisticated that it will make today’s top-of-the-line device technologies seem as crude in comparison, to future eyes, as the practice of chiseling stone tablets looks to us today.”615 According to Dr. Frank, the biggest current challenge centers on the trade-off between energy dissipation and processing speed.616 Dr. Christiano also mentioned challenges imposed by an inability to expend energy in order to actively set relevant physical variables into particular states: the computation needs to work for whatever state different physical variables happen to end up in.617

For context, the energy dissipation per logical bit-erasure in current digital computers appears to be ~1e5-1e6 worse than Landauer’s limit, and progress is expected to asymptote between 1e3 and 1e5.618 A V100 GPU, at 1e14 FLOP/s and 300W, requires ~1e9 0.69kT per FLOP (assuming room temperature).619 So in order to perform the logically-reversible equivalent of a FLOP for less than 0.69kT, you’d need a roughly billion-fold increase in energy efficiency.

Of course, biological systems have strong incentives to reduce energy costs.620 And some computational processes in biology are extremely efficient.621 But relative to a standard of 0.69kT per operation, the brain’s mechanisms generally appear highly dissipative.622 For example:

Laughlin et al. (1998) suggest that synapses and cells use ~1e5-1e8kT per bit “observed” (though I don’t have a clear sense of what the relevant notion of observation implies).623
A typical cortical spike dissipates around 1e10-1e11kT.624 Prof. David Wolpert noted that this process involves very complicated physical machinery, which he expects to be very far from theoretical limits of efficiency, being used to propagate a single bit.625
Dr. Riedel mentioned that the nerves conveying a signal to kick your leg burn much more than 0.69kT per bit required to say how much to move the muscle.626
A single molecule of ATP (the brain’s main energy currency) releases ~25kT,627 and Dr. Christiano was very confident that the brain would need at least 10 ATPs to get computational mileage equivalent to a FLOP.628 At a rough maximum of ~2e20 ATPs per second,629 this would suggest <2e19 FLOP/s.

Of course, the relevant highly-non-dissipative information-processing could be hiding somewhere we can’t see, and/or occurring in a way we don’t understand. But various experts also mentioned more general features of the brain that make it poorly suited to this, including:

The size of its components.630
Its warm temperature.631
The need to boost signals in order to contend with classical noise.632
Its reliance on diffusion to propagate information.633
The extreme difficulty of building reversible computers in general.634

All of this seems to me like fairly strong evidence for something like 1*.

Note, though, that Landauer’s principle isn’t playing a very direct role here. We had intended to proceed from an estimate of the brain’s energy budget, to an upper bound on its logical bit-erasures (via Landauer’s principle), to an upper bound on the FLOP/s required to match its task performance. But hardware arguments skip the middle step, and just argue directly that you don’t need more than one FLOP per 0.69kT used by the brain. I think that this is probably true, but absent this middle step, 0.69kT doesn’t seem like a clearly privileged number to focus on.

4.3 Overall weight for the limit method

Overall, it seems very unlikely to me that more than ~7e21 FLOP/s is required to match the brain’s task-performance. This is centrally because various experts I spoke to seemed confident about claims in the vicinity of (1), (1*), and (2) above; partly because those claims seem plausible to me as well; and partly because other methods generally seem to point to lower numbers.635

Indeed, lower numbers (e.g., 1e21 – ~ the maximum 8-bit irreversible FLOP/s a computer running on 20W at 310 Kelvin could perform, and 1e20 – the maximum number of required FLOP/s, assuming at least one ATP per required FLOP) seem likely to me to be overkill as well.636

That said, this doesn’t seem like a case of a hard physical limit imposing a clean upper bound. Even equipped with an application of the relevant limit to the brain (various aspects of this still confuse me – see endnote), further argument is required.637 And indeed, the arguments that seem most persuasive to me (e.g., hardware arguments) don’t seem to rely very directly on the limit itself. Still, we should take whatever evidence we can get.

5 The communication method

Let’s briefly discuss a final method (the “communication method”), which attempts to use the communication bandwidth in the brain as evidence about its computational capacity. I haven’t explored this much, but I think it might well be worth exploring.

Communication bandwidth, here, refers to the speed with which a computational system can send different amounts of information different distances.638 This is distinct from the operations per second that a system can perform (computation), but it’s just as hard a constraint on what the system can do.

Estimating the communication bandwidth in the brain is a worthy project in its own right. But it also might help with computation estimates. This is partly because the marginal value of additional computation and communication are related (e.g., too little communication and your computational units sit idle; too few computational units and it becomes less useful to move information around).

Can we turn this into a FLOP/s estimate? The basic form of the argument would be roughly:

The profile of communication bandwidth in the brain is X.
If the profile of the communication bandwidth in the brain is X, then Y FLOP/s is probably enough to match its task performance.

I’ll discuss each premise in turn.

5.1 Communication in the brain

One approach to estimating communication in the brain would be to identify all of the mechanisms involved in it, together with the rates at which they can send different amounts of information different distances.

Axons are clearly a central mechanism here, and one in which a sizeable portion of the brain’s energy and volume have been invested.639 There is a large literature on estimating the information communicated by action potentials.640
Dendrites also seem important, though generally at shorter distances (and at sufficiently short distances, distinctions between communication and computation may blur).641
Other mechanisms (e.g. glia, neuromodulation, ephaptic effects, blood flow – I’m less sure about gap junctions) are plausibly low-bandwidth relative to axons and dendrites.642 If so, this would simplify the estimate. And the resources invested in axons and dendrites would make it seem somewhat strange if the brain has other, superior forms of communication available.643

Dr. Paul Christiano suggests a rough estimate of ~10 bits per spike for axon communication, and uses this to generate the bounds of ~1e9 bytes/s of long-distance communication across the brain, 1e11 bytes/s of short-distance communication (where each neuron could access ~1e7 nearby neurons), and larger amounts of very short-distance communication.644

Another approach would be to draw analogies with metrics used to assess the communication capabilities of human computers. AI Impacts, for example, recommends the traversed edges per second (TEPS) metric, which measures the time required to perform a certain kind of search through a random graph.645 They treat neurons as vertices on the graph, synapses as edges, and spikes through synapses as traversals of edges, yielding an overall estimate of ~2e13-6e14 TEPS (the same as their estimate of the number of spikes through synapses).646

I haven’t investigated either of these estimates in detail. But they’re instructive examples.

5.2 From communication to FLOP/s

How do we move from a communication profile for the brain, to an estimate of the FLOP/s sufficient to match its task performance? There are a number of possibilities.

One simple argument runs as follows: if you have two computers comparable on one dimension important to performance (e.g., communication), but you can’t measure how they compare on some other dimension (e.g., computation), then other things equal, your median guess should be that they are comparable on this other dimension as well.647 Here, the assumption would be that the known dimension reflects the overall skill of the engineer, which was presumably applied to the unknown dimension as well.648 As an analogy: if all we know is that Bob’s cheesecake crusts are about as good as Maria’s, the best median guess is that they’re comparable cheesecake chefs, and hence that his cheesecake filling is about as good as hers as well.

Of course, we know much about brains and computers unrelated to how their communication compares. But those drawn to simple a priori arguments, perhaps this sort of approach can be useful.

Using Dr. Christiano’s estimates, discussed above, one can imagine comparing a V100 GPU to the brain as follows:649

**Figure 18: Comparing the brain to a V100.**
METRIC	V100	HUMAN BRAIN
Short-distance communication	1e12 bytes/s of memory bandwidth	1e11 bytes/s to nearby neurons? (not vetted)650
Long-distance communication	3e11 bytes/s of off-chip bandwidth	1e9 bytes/s across the brain? (not vetted)651
Computation	1e14 FLOP/s	?

On these estimates, the V100’s communication is at least comparable to the brain’s (indeed, it’s superior by between 10 and 300x). Naively, then, perhaps its computation is comparable (indeed, superior) as well.652 This would suggest 1e14 FLOP/s or less for the brain.

That said, it seems like a full version of this argument would include other available modes of comparison as well (continuing the analogy above: if you also know that that Maria’s jelly cheesecake toppings are much worse than Bob’s, you should take this into account too). For example, if we assume that synapse weights are the central means of storing memory in the brain,653 we might get:

**Figure 19: Comparing the brain to a V100, continued.**
METRIC	V100	HUMAN BRAIN
Memory	3e10 bytes on chip	1e14-1e15 synapses,654 each storing >5 bits?655
Power consumption	300W	20W656

So the overall comparison here becomes more complicated. V100 power consumption is >10x worse, and comparable memory, on this naive memory estimate for the brain, would require a cluster of ~3000-30,000 V100s, suggesting a corresponding increase to the FLOP/s attributed to the brain (memory access across the cluster would become more complex as well, and overall energy costs would increase).657

A related approach involves attempting to identify a systematic relationship between communication and computation in human computers – a relationship that might reflect trade-offs and constraints applicable to the brain as well.658 Thus, for example, AI Impacts examines the ratio of TEPS to FLOP/s in eight top supercomputers, and finds a fairly consistent ~500-600 FLOP/s per TEPS.659 Scaling up from their TEPS estimate for the brain, they get ~1e16-3e17 FLOP/s.660

A more sophisticated version of this approach would involve specifying a production function governing the returns on investment in marginal communication vs. computation.661 This function might allow evaluation of different hypothesized combinations of communication and computation in the brain. Thus, for example, the hypothesis that the brain performs the equivalent of 1e20 FLOP/s, but has the communication profile listed in the table above, might face the objection that it assigns apparently sub-optimal design choices to evolution: e.g., in such a world, the brain would have been better served re-allocating resources invested in computation (energy, volume, etc.) to communication instead.

And even if the brain were performing the equivalent of 1e20 FLOP/s (perhaps because it has access to some very efficient means of computing), such a production function might also indicate a lower FLOP/s budget sufficient, in combination with more communication than the brain can mobilize, to match the brain’s task performance overall (since there may be diminishing returns to more computation, given a fixed amount of communication).662

These are all just initial gestures at possible approaches, and efforts in this vein face a number of issues and objections, including:

Variation in optimal trade-offs between communication and computation across tasks.
Changes over time to the ratio of communication to computation in human-engineered computers.663
Differences in the constraints and trade-offs faced by human designers and evolution.

I haven’t investigated the estimates above very much, so I don’t put much weight on them. But I think approaches in this vicinity may well be helpful.

6 Conclusion

I’ve discussed four different methods of generating FLOP/s budgets big enough to perform tasks as well as the human brain. Here’s a summary of the main estimates, along with the evidence/evaluation discussed:

**Figure 20: Summary and description of the main estimates discussed in the report.**
ESTIMATE	DESCRIPTION	~FLOP/S	SUMMARY OF EVIDENCE/EVALUATION
Mechanistic method low	~1 FLOP per spike through synapse; neuron models with costs ≤ Izhikevich spiking models run with 1 ms time-steps.	1e13-1e15	Simple model, and the default in the literature; some arguments suggest that models in this vein could be made adequate for task-performance without major increases in FLOP/s; these arguments are far from conclusive, but they seem plausible to me, and to some experts (others are more skeptical).
Mechanistic method high	~100 FLOPs per spike through synapse; neuron models with costs greater than Izhikevich models run with 1 ms time-steps, but less than single-compartment Hodgkin-Huxley run with 0.1 ms timesteps.	1e15-1e17	It also seems plausible to me that FLOP/s budgets for a fairly brain-like task-functional model would need to push into this range in order to cover e.g. learning, synaptic conductances, and dendritic computation (learning seems like an especially salient candidate here).
Mechanistic method very high	Budgets suggested by more complex models – e.g., detailed biophysical models, large DNN neuron models, very FLOPs-intensive learning rules.	>1e17	I don’t see much strong positive evidence that you need this much, even for fairly brain-like models, but it’s possible, and might be suggested by higher temporal resolutions, FLOP/s intensive DNN models of neuron behavior, estimates based on time-steps per variable, greater biophysical detail, larger FLOPs budgets for processes like dendritic computation/learning, and/or higher estimates of parameters like firing rate or synapse count.
Scaling up the DNN from Beniaguev et al. (2020)	Example of an estimate >1e17 FLOP/s. Uses the FLOP/s for a DNN-reduction of a detailed biophysical model of a cortical neuron, scaled up by 1e11 neurons.	1e21	I think that this is an interesting example of positive evidence for very high mechanistic method estimates, as Beniaguev et al. (2020) found it necessary to use a very large model in order to get a good fit. But I don’t give this result on its own a lot of weight, partly because their model focuses on predicting membrane potential and individual spikes very precisely, and smaller models may prove adequate on further investigation.
Mechanistic method very low	Models that don’t attempt to model every individual neuron/synapse.	<1e13	It seems plausible to me that something in this range is enough, even for fairly brain-like models. Neurons display noise, redundancy, and low-dimensional behavior that suggest that modeling individual neurons/synapses might be overkill; mechanistic method estimates based on low-level components (e.g. transistors) substantially overestimate FLOP/s capacity in computers we actually understand; emulation imposes overheads; and the brain’s design reflects evolutionary constraints that could allow further simplification.
Functional method estimate based on Moravec’s retina estimate, scaled up to whole brain	Assumes 1e9 calculations per second for the retina (100 calculations per edge/motion detection per, 10 edge/motion detections per second per cell, 1e6 cells); scaled up by 1e3-1e6 (the range suggested by portion of mass, volume, neurons, synapses, and energy).	1e12-1e15 (assuming 1 calculation ~= 1 FLOP)	The retina does a lot of things other than edge and motion detection (e.g., it anticipates motion, it can signal that a predicted stimulus is absent, it can adapt to different lighting conditions, it can suppress vision during saccades); and there are lots of differences between the retina and the brain as a whole. But the estimate, while incomplete in its coverage of retinal function, might be instructive regardless, as a ballpark for some central retinal operations (I haven’t vetted the numbers Moravec uses for edge/motion detection, but Prof. Barak Pearlmutter expected them to be accurate).664
Functional method estimate based on DNN models of the retina, scaled up to the whole brain	Estimates of retina FLOP/s implied by the models in Batty et al. (2017) (1e14 FLOP/s) and Maheswaranathan et al. (2019) (1e13 FLOP/s), scaled up to the brain as a whole using the same 1e3-1e6 range above.	1e16-1e20	I think this is some weak evidence for numbers higher than 1e17, and the models themselves are still far from full replications of retinal computation. However, I’m very uncertain about what it looks like to scale these models up to the retinas as a whole. And it also seems plausible to me that these models use many more FLOP/s than required to do what the retina does. For example, their costs reflect implementation choices and model sizes that haven’t yet been shown necessary, and Moravec’s estimate (even if incomplete) is much lower.
Low end functional method estimate based on the visual cortex	Treats a 10 Hz EfficientNet-B2 image classifier, scaled up by 10x, as equivalent to 10% of the visual cortex’s information-processing capacity, then scales up to the whole brain based on portion of neurons (portion of synapses, volume, mass, and energy consumption might be larger, if the majority of these are in the cortex).	1e13-1e14	In general, I hold these estimates lightly, as I feel very uncertain about what the visual cortex is doing overall and how to compare it to DNN image classifiers, as well as about the scale-up in model size that will be required to reach image classification performance as generalizable across data sets and robust to adversarial examples as human performance is (the high-end correction for this used here – 1000x – is basically just pulled out of thin air, and could be too low). That said, I do think that, to the extent it makes sense at all to estimate the % of the visual cortex’s information-processing capacity mobilized in performing a task analogous to image classification, the number should be macroscopic enough to explain the interesting parallels between the feature detection in image classifiers and in the visual cortex (see Section 3.2 for discussion). 1% of V1 seems to me reasonably conservative in this regard, especially given that CNNs trained on image classification end up as state of the art predictors of neural activity in V1 (as well as elsewhere in the visual cortex). So I take these estimates as some weak evidence that the mechanistic method estimates I take most seriously (e.g., 1e13-1e17) aren’t way too low.
Middle-range functional method estimate based on visual cortex	Same as previous, but scales up 10 Hz EfficientNet-B2 by 100x, and treats it as equivalent to 1% of the visual cortex’s information-processing capacity.	1e15-1e16
High end functional method estimate based on visual cortex	Same as previous, but scales up 10 Hz EfficientNet-B2 by 1000x instead, and treats it as equivalent to 1% of V1’s information-processing capacity.	3e16-3e17
Limit method low end	Maximum 8-bit, irreversible FLOP/s that a computer running on 20W at body temperature can perform, assuming current digital multiplier implementations (~500 bit-erasures per 8-bit multiply).	1e19	I don’t think that a robust version of the limit method should assume that the brain’s operations are analogous to standard, irreversible FLOP/s (and especially not FLOP/s in digital computers, given that there may be more energy-efficient analog implementations available – see Sarpeshkar (1998)). But it does seem broadly plausible to me that a minimal, computationally useful operation in the brain erases at least one logical bit, and very plausible that it dissipates at least 0.69kT (indeed, my best guess would be that it dissipates much more than that, given that cortical spikes dissipate 1e10-1e11kT; a single ATP releases ~25kT; the brain is noisy, warm, and reliant on comparatively large components, etc.). And it seems plausible, as well, that a FLOP is enough to replicate the equivalent of a minimal, computationally useful operation in the brain. Various experts (though not all) also seemed quite confident about claims in this vicinity. So overall, I do think it very unlikely that required FLOP/s exceeds e.g. 1e21. However, I don’t think this is a case of a physical limit imposing a clean upper bound. Rather, it seems like one set of arguments amongst others. Indeed, the arguments that seem strongest to me (e.g., arguments that appeal to the energy dissipated by the brain’s mechanisms) don’t seem to rely directly on Landauer’s principle at all.
Limit method middle	Maximum 8-bit, irreversible FLOP/s that a computer running in 20W at body temperature can perform, assuming no intermediate bit-erasures (just a transformation from two n-bit inputs to one n-bit output).	1e21
Limit method high	Maximum FLOP/s, assuming at least one logical bit-erasure, or at least 0.69kT dissipation, per required FLOP.	7e21
ATPs	Maximum FLOP/s, assuming at least one ATP used per required FLOP.	1e20
Communication method estimate based on comparison with V100	Estimates brain communication capacity, compares it to a V100, and infers on the basis of the comparability/inferiority of the brain’s communication to a V100s communication, perhaps it’s computational capacity is comparable/inferior as well.	≤1e14	I haven’t vetted these estimates much and so don’t put much weight on them. The main general question is whether the relationship between communication and computation in human-engineered computers provides much evidence about what to expect that relationship to be in the brain. Initial objections to comparisons to a V100, even granting the communication estimates for the brain that it’s based on, might center on complications introduced by also including memory and energy consumption in the comparison. Initial objections to relying on TEPS-FLOP/s ratios might involve the possibility that there are meaningfully more relevant “edges” in the brain than synapses, and/or “vertices” than neurons. Still, I think that approaches in this broad vicinity may well prove helpful on further investigation.
Communication method estimate based on TEPS to FLOP/s extrapolation	Estimates brain TEPS via an analogy between spikes through synapses and traversals of an edge in a graph; then extrapolates to FLOP/s based on observed relationship between TEPS and FLOP/s in a small number of human-engineered computers.	1e16-3e17 FLOP/s

Here are the main numbers plotted together:

**Figure 1, repeated. The report’s main estimates.**

None of these numbers are direct estimates of the minimum possible FLOP/s budget. Rather, they are different attempts to use the brain – the only physical system we know of that performs these tasks, but far from the only possible such system – to generate some kind of adequately (but not arbitrarily) large budget. If a given method is successful, it shows that a given number of FLOP/s is enough, and hence, that the minimum is less than that. But it doesn’t, on its own, indicate how much less.

Can we do anything to estimate the minimum directly, perhaps by including some sort of adjustment to one or more of these numbers? Maybe, but it’s a can of worms that I don’t want to open here, as addressing the question of where we should expect the theoretical limits of algorithmic efficiency to lie relative to these numbers (or, put another way, how many FLOP/s we should expect superintelligent aliens to use, if they were charged with replicating human-level task-performance using FLOPs) seems like a further, difficult investigation (though Dr. Paul Christiano expected the brain to be performing at least some tasks in close to maximally efficient ways, using a substantial portion of its resources – see endnote).665

Overall, I think it more likely than not that 1e15 FLOP/s is enough to perform tasks as well as the human brain (given the right software, which may be very hard to create). And I think it unlikely (<10%) that more than 1e21 FLOP/s is required. That said, as emphasized above:

The numbers above are just very loose, back-of-the-envelope estimates.
I am not a neuroscientist, and there is no consensus on this topic in neuroscience (or elsewhere).
Basically all of my best-guesses are based on a mix of (a) shallow investigation of messy, unsettled science, and (b) a limited, non-representative sampling of expert opinion.

More specific probabilities require answering questions about the theoretical limits of algorithmic efficiency – questions that I haven’t investigated and that I don’t want to overshadow the evidence actually surveyed in the report. In the appendix, I discuss a few narrower conceptions of the brain’s FLOP/s capacity, and offer a few more specific probabilities there, keyed to one particular type of brain model. My current best-guess median for the FLOP/s required to run that particular type of model is around 10¹⁵ (recall that none of these numbers are estimates of the FLOP/s uniquely “equivalent” to the brain).

As can be seen from the figure above, the FLOP/s capacities of current computers (e.g., a V100 at ~1e14 FLOP/s for ~$10,000, the Fugaku supercomputer at ~4e17 FLOP/s for ~$1 billion) cover the estimates I find most plausible.666 However:

Task-performance requires resources other than FLOP/s (for example, memory and memory bandwidth).
Performing tasks on a particular machine can introduce further overheads and complications.
Most importantly, matching the human brain’s task-performance requires actually creating sufficiently capable and computationally efficient AI systems, and this could be extremely (even prohibitively) difficult in practice even with computers that could run such systems in theory. Indeed, as noted above, the FLOP/s required to run a system that does X can be available even while the resources (including data) required to train it remain substantially out of reach. And what sorts of task-performance will result from what sorts of training is itself a further, knotty question.667

So even if my best-guesses are correct, this does not imply that we’ll see AI systems as capable as the human brain anytime soon.

6.1 Possible further investigations

Here are a few projects that others interested in this topic might pursue (this list also doubles as a catalogue of some of my central ongoing uncertainties).

Mechanistic method

Investigate the literature on population-level modeling and/or neural manifolds, and evaluate what sorts of FLOP/s estimates it might imply.
Investigate the best-understood neural circuits (for example, Prof. Eve Marder mentioned some circuits in leeches, C. elegans, flies, and electric fish), and what evidence they provide about the computational models adequate for task-performance.668
Follow up on the work in Beniaguev et al. (2020), testing different hypotheses about the size of deep neural networks required to fit neuron behavior with different levels of accuracy.
Investigate the computational requirements and biological plausibility of different proposed learning rules in the brain in more depth.
Investigate more deeply different possible hypotheses about molecular-level intracellular signaling processes taking place in the brain, and the FLOP/s they might imply.
Investigate the FLOP/s implications of non-binary forms of axon signaling in more detail.

Functional method

Following up on work by e.g. Batty et al. (2017) and Maheswaranathan et al. (2019), try to gather more data about the minimal artificial neural network models adequate to predict retinal spike trains across trials at different degrees of accuracy (including higher degrees of accuracy than these models currently achieve).
Create a version of Moravec’s retina estimate that covers a wider range of computations that the retina performs, but which still focuses on high-level tasks rather than spike trains.
Investigate the literature on comparisons between the feature detection in DNNs and in the visual cortex, and try to generate better quantitative estimates of the overlap and the functional method FLOP/s it would imply.
Based on existing image classification results, try to extrapolate to the model size required to achieve human-level robustness to adversarial examples and/or generalization across image classification data sets.
Investigate various other types of possible functional methods (for example, estimates based on ML systems performing speech recognition).

Limit method

Investigate and evaluate more fleshed-out versions of algorithmic arguments.
Look for and evaluate examples in biology where the limit method might give the wrong answer: e.g., where a biological system is performing some sort of useful computation that would require more than a FLOP to replicate, but which dissipates less than 0.69kT.

Communication method

Estimate the communication bandwidth available in the brain at different distances.
Investigate the trade-offs and constraints governing the relationship between communication and computation in human-engineered computers across different tasks, and evaluate the extent to which these would generalize to the brain.

General

Gather more standardized, representative data about expert opinion on this topic.
Investigate what evidence work on brain-computer interfaces might provide.
Investigate and evaluate different methods of estimating the memory and/or number of parameters in the brain – especially ones that go beyond just counting synapses. What would e.g., neural manifolds, different models of state retention in neurons, models of biological neurons as multi-layer neural networks, dynamical models of synapses, etc., imply about memory/parameters?
(Ambitious) Simulate a simple organism like C. elegans at a level of detail adequate to replicate behavioral responses and internal circuit dynamics across a wide range of contexts, then see how much the simulation can be simplified.

7 Appendix: Concepts of brain FLOP/s

It is reasonably common for people to talk about the brain’s computation/task-performance in terms of metrics like FLOP/s. It is much less common for them to say what they mean.

When I first started this project, I thought that there might be some sort of clear and consensus way of understanding this kind of talk that I just hadn’t been exposed to. I now think this much less likely. Rather, I think that there are a variety of importantly different concepts in this vicinity, each implying different types of conceptual ambiguity, empirical uncertainty, and relevant evidence. These concepts are sufficiently inter-related that it can be easy to slip back and forth between them, or to treat them as equivalent. But if offering estimates, or making arguments about e.g. AI timelines using such estimates, it matters which you have in mind.

I’ll group these concepts into four categories:

FLOP/s required for task-performance, with no further constraints.
FLOP/s required for task-performance + brain-like-ness constraints (e.g., constraints on the similarity between the task-functional model and the brain’s internal dynamics).
FLOP/s required for task-performance + findability constraints (e.g., constraints on what sorts of processes would be able to create/identify the task-functional model in question).
Other analogies with human-engineered computers.

I find it useful, in thinking about these concepts, to keep the following questions in mind:

Single answer: Does this concept identify a single, well-defined number of FLOP/s?
Non-arbitrariness. Does it involve a highly arbitrary point of focus?
One-FLOP-per-FLOP: To the extent that this concept purports to represent the brain’s FLOP/s capacity, does an analogous concept, applied to a human-engineered computer, identify the number of FLOP/s that computer actually performs? E.g., applied to a V100, does it pick out 1e14 FLOP/s?669
Relationship to the literature: To what extent do estimates offered in the literature on this topic (mechanistic method, functional method, etc.) bear on the FLOP/s this concept refers to?
Relevance to AI timelines: How relevant is this number of FLOP/s to when we should expect humans to develop AI systems that match human-level performance?

This appendix briefly discusses some of the pros and cons of these concepts in light of such questions, and it offers some probabilities keyed to one in particular.

7.1 No constraints

This report has focused on the evidence the brain provides about the FLOP/s sufficient for task-performance, with no further constraints on the models/algorithms employed in performing the tasks. I chose this point of focus centrally because:

Its breadth makes room for a wide variety of brain-related sources of evidence to be relevant.
It avoids the disadvantages and controversies implied by further constraints (see below).
It makes the discussion in the report more likely to be helpful to people with different assumptions and reasons for interest in the topic.

However, it has two main disadvantages:

As noted in the report, evidence that X FLOP/s is sufficient is only indirect evidence about the minimum FLOP/s required; and the overall probability that X is sufficient depends, not just on evidence from the brain/current AI systems, but on further questions about where the theoretical limits of algorithmic efficiency are likely to lie. That said, as noted earlier, Dr. Paul Christiano expected there to be at least some tasks such (a) the brain’s methods of performing them are close to maximally efficient, and (b) these methods use most of the brain’s resources.670 I haven’t investigated this, but if true, it would reduce the force of this disadvantage.
The relevance of in principle FLOP/s requirements to AI timelines is fairly indirect. If you know that Y type of task-performance is impossible without X FLOP/s, then you know that you won’t see Y until X FLOP/s are available. But once X FLOP/s are available (as I think they probably are), the question of when you’ll see Y is still wide open. You know that superintelligent aliens could do it in theory, if forced to use only the FLOP/s your computers make available. But on its own, this gives you very little indication of when humans will do it in practice.

In light of these disadvantages, let’s consider a few narrower points of focus.

7.2 Brain-like-ness

One option is to require that models/algorithms employed in matching the brain’s task-performance exhibit some kind of resemblance to its internal dynamics as well. Call such requirements “brain-like-ness constraints.”

Such constraints restrict the set of task-functional models under consideration, and hence, to some extent, the relevance of questions about the theoretical limits of algorithmic efficiency. And they may suggest a certain type of “findability,” without building it into the definition of the models/algorithms under consideration. The brain, after all, is the product of evolution – a search and selection process whose power may be amenable to informative comparison with what we should expect the human research community to achieve.

But brain-likeness constraints also have disadvantages. Notably:

From the perspective of AI timelines, it doesn’t matter whether the AI systems in question are brain-like.
Functional method estimates are based on human-engineered systems that aren’t designed to meet any particular brain-like-ness constraints.
It’s difficult to define brain-like-ness constraints in a manner that picks out a single, privileged number of FLOP/s, without making seemingly-arbitrary choices about the type of brain-like-ness in question and/or losing the One-FLOP-per-FLOP criterion above.

This last problem seems especially salient to me. Here are some examples where it comes up.

Brain simulations

Consider the question: what’s the minimum number of FLOP/s sufficient to simulate the brain? At a minimum, it depends on what you want the simulation to do (e.g., serve as a model for drug development? teach us how the brain works? perform a given type of task?). But even if we focus on replicating task-performance, there still isn’t a single answer, because we have not specified the level of brain-like-ness required to count as a simulation of the brain, assuming task-performance stays fixed.671 Simulating individual molecules is presumably not required. Is replicating the division of work between hemispheres, but doing everything within the hemispheres in a maximally efficient but completely non-brain-like-way, sufficient?672 If so, we bring back many of the questions about the theoretical limits of algorithmic efficiency we were aiming to avoid. If not, where’s the line in between? We haven’t said.

“Reasonably brain-like” models

A similar problem arises if we employ a vaguer standard – requiring, for example, that the algorithm in question be “reasonably brain-like.” What counts? Are birds reasonably plane-like? Are the units of a DNN reasonably neuron-like? Some vagueness is inevitable, but this is, perhaps, too much.

Just picking a constraint

One way to avoid this would be to just pick a precisely-specified type of brain-likeness to require. For example, we might require that the simulation feature neuron-like units (defined with suitable precision), a brain-like connectome, communication via binary spikes, brain-like average firing rates, but not e.g. individual ion channels, protein dynamics, membrane potential fluctuations, etc. But why these and not others? Absent a principled answer, the choice seems arbitrary.

The brain’s algorithm

Perhaps we might appeal to the FLOP/s required to reimplement what I will call “the brain’s algorithm.” The idea, here, would be to assume that there is a single, privileged description of how the brain performs the tasks that it performs – a description that allows us to pick out a single, privileged number of FLOP/s required to perform those tasks in that way.

We can imagine appealing, here, to influential work by David Marr, who distinguished between three different levels of understanding applicable to an information-processing system:

The computational level: the overall task that the system in question is trying to solve, together with the reason it is trying to solve this task.
The algorithmic level: how the task-relevant inputs and outputs are represented in the system, together with the intermediate steps of the input-output transformation.
The implementation level: how these representations and this algorithm are physically implemented.673

The report focused on level 1. But suppose we ask, instead: how many FLOP/s are required to replicate level 2? Again, the same problem arises: which departures from brain-like-ness are compatible with reimplementing the brain’s algorithm, and which are not (assuming high-level task performance remains unaffected regardless)? I have yet to hear a criterion that seems to me an adequate answer.674

Note that this problem arises even if we assume clean separations between implementation and algorithmic levels in the brain – a substantive assumption, and one that may be more applicable in the context of human-engineered computers than biological systems.675 For even in human-engineered computers, there are multiple algorithmic levels. Consider someone playing Donkey Kong on an MOS 6502. How many FLOP/s do you need to reimplement the “algorithmic level” of the MOS 6502, or to play Donkey Kong “the way the MOS 6502 does it”? I don’t think there’s a single answer. Do we need to emulate individual transistors, or are logic gates enough? Can we implement the adders, or the ALU, or the high-level architecture, in a different way? A full description of how the system performs the task involves all these levels of abstraction simultaneously. Given a description of an algorithm (e.g., a set of states and rules for transitioning between them), we can talk about the operations required to implement it.676 But given an actual physical system operating on multiple levels of abstraction, it’s much less clear what talk about the algorithm it’s implementing refers to.677

**Figure 21:** Levels of abstraction in a microprocessor. From Jonas and Kording (2016), p. 5, Figure 1, unaltered, licensed under CC BY 4.0. Original caption: “A microprocessor is understood at all levels. **(A)** The instruction fetcher obtains the next instruction from memory. This then gets converted into electrical signals by the instruction decoder, and these signals enable and disable various internal parts of the processor, such as registers and the arithmetic logic unit (ALU). The ALU performs mathematical operations such as addition and subtraction. The results of these computations can then be written back to the registers or memory. **(B)** Within the ALU there are well-known circuits, such as this one-bit adder, which sums two one-bit signals and computes the result and a carry signal. **(C)** Each logic gate in **(B)** has a known truth table and is implemented by a small number of transistors. **(D)** A single NAND gate is comprised of transistors, each transistor having three terminals **(E)**. We know **(F)** the precise silicon layout of each transistor.”

The lowest algorithmic level

Perhaps we could focus on the lowest algorithmic level, assuming this is well-defined (or, put another way, on replicating all the algorithmic levels, assuming that the lowest structures all the rest)? One problem with this is that even if we knew that a given type of brain simulation – for example, a connectome-like network of Izhikevich spiking neurons – could be made task-functional, we wouldn’t yet know whether it captured the level in question. Are ion channels above or below the lowest algorithmic level? To many brain modelers, these questions don’t matter: if you can leave something out without affecting the behavior you care about, all the better. But focusing on the lowest-possible algorithmic level brings to the fore abstract questions about where this level lies. And it’s not clear, at least to me, how to answer them.678

Another problem with focusing on the lowest algorithmic level is, to the extent that we want a FLOP/s estimate that would be to the brain what 1e14 FLOP/s is to a V100, we’ll do poorly on the One-FLOP-per-FLOP criterion above: e.g., if we assume that the lowest algorithmic level in a V100 is at the level of transistors, we’ll end up budgeting many more FLOP/s for a transistor-level simulation than the 1e14 FLOP/s the V100 actually performs.679

The highest algorithmic level

What about the highest algorithmic level? As with the lowest algorithmic level, it’s unclear where this highest level lies, and very high-level descriptions of the brain’s dynamics (analogous, e.g., to the “processor architecture” portion of the diagram above) may leave a lot of room for intuitively non-brain-like forms of efficiency (recall the “simulation” of the brain’s hemispheres discussed above). And it’s not clear that this standard passes the “one-FLOP-per-FLOP” test either: if a V100 performing some task is inefficient at some lower level of algorithmic description, then the maximally efficient way of performing that task in a manner that satisfies some higher level of description may use fewer FLOP/s than the V100 performs.

Nothing that doesn’t map to the brain

Nick Beckstead suggests a brain-like-ness constraint on which the algorithm used to match the brain’s task performance must be such that (a) all of its algorithmic states map onto brain states, and (b) the transitions between these algorithmic states mirror the transitions between the corresponding brain states.680 Such a constraint rules out replicating the division of work between hemispheres, but doing everything else in a maximally efficient way, because the maximally efficient way will presumably involve algorithmic states that don’t map onto brain states.

This constraint requires specifying the necessary accuracy of the mapping from algorithmic states to brain states (though note that defining task-performance at all requires something like this).681 I also worry that whether a given algorithm satisfies this constraint or not will end up depending on which operations are treated as basic (and hence immune from the requirement that the state-transitions involved in implementing them map onto the brain’s).682 And it’s not clear to me that this definition will capture One-FLOP-per-FLOP, since it seems to require a very high degree of emulation accuracy. That said, I think something in this vicinity might turn out to work.

More generally, though, brain-like-ness seems only indirectly relevant to what we ultimately care about, which is task-performance itself. Can findability constraints do better?

7.3 Findability

Findability constraints restrict attention to the FLOP/s required to run task-functional systems that could be identified or created via a specific type of process. Examples include task-functional systems that:

humans will in fact create in the future (or, perhaps, the first such systems);
humans would/could create, given access to a specific set of resources and/or data;
would/could be identified via a specific type of training procedure – for example, a procedure akin to those used in machine learning today;
could/would be found via a specified type of evolution-like search process, akin to the one that “found” the biological brain;
could be created by an engineer “as good as evolution” at engineering.683

The central benefit of all such constraints is that they are keyed directly to what it takes to actually create a task-functional system, rather than what systems could exist in principle. This makes them more informative for the purposes of thinking about when such systems might in fact be created by humans.

But it’s also a disadvantage, as estimates involving findability constraints require answering many additional, knotty questions about what types of systems are what kinds of findable (e.g., what sorts of research programs or training methods could result in what sorts of task performance; what types of resources and data these programs/methods would require; what would in fact result from various types of counterfactual “evolution-like” search processes, etc.).

Findability constraints related to evolution-like search processes/engineering efforts (e.g., (d) and (e) above) are also difficult to define precisely, and they are quite alien to mainstream neuroscientific discourse. This makes them difficult to solicit expert opinion about, and harder to evaluate using evidence of the type surveyed in the report.

My favorite of these constraints is probably the FLOP/s that will be used by the first human-built systems to perform these tasks, since this is the most directly relevant to AI timelines. I see functional method estimates as especially relevant here, and mechanistic/limit method estimates as less so.

7.4 Other computer analogies

There are a few other options as well, which appeal to various other analogies with human-engineered computers.

Operations per second

For example, we can imagine asking: how many operations per second does the brain perform? One problem here is that “operations” does not have a generic meaning. An operation is just an input-output relationship, implemented as part of a larger computation, and treated as basic for the purpose of a certain kind of analysis.684 The brain implements many different such relationships at different levels of abstraction: for example, it implements many more “ion-channel opening/closing” operations per second than it does “spikes through synapses” operations.685 Estimates that focus on the latter, then, need to say why they do so. You can’t just pick a thing to count, and count it.

More importantly, our ultimate interest is in systems that run on FLOP/s, that perform tasks at human-levels. To be relevant to this, then, we also need to know how many FLOP/s are sufficient to replicate one of the operations in question; and we need some reason to think that, so replicated, the resulting FLOP/s budget overall would be enough for task-performance. This amounts to something closely akin to the mechanistic method, and the same questions about the required degree of brain-like-ness apply.

FLOP/s it performs

What if we just asked directly: how many FLOP/s does the brain perform? Again, we need to know what is meant.

One possibility is that we have in mind one of the other questions above: e.g., how many FLOP/s do you need to perform some set of tasks that the brain performs, perhaps with some kind of implicit brain-like-ness constraint. This raises the problems discussed in 7.1 and 7.2 above.
Another possibility is that we are asking more literally: how many times per second does the brain’s biophysics implement e.g. an addition, subtraction, multiplication, or division operation of a given level of precision? In some places, we may be able to identify such implementation – for example, if synaptic transmission implements an addition operation via the postsynaptic membrane potential. In other places, though, the task-relevant dynamics in the brain may not map directly to basic arithmetic; rather, they may be more complicated, and require multiple FLOPs to capture. If we include these FLOPs (as we should, if we want the question to be relevant to the hardware requirements for advanced AI systems), we’re back to something closely akin to the mechanistic method, and to the same questions about brain-like-ness.

Usefulness limits

I’ll consider one final option, which seems to me (a) promising and (b) somewhat difficult to think about.

Suppose you were confronted with a computer performing various tasks, programmed by a programmer of unclear skill, using operations quite dissimilar from FLOP/s. You want some way of quantifying this computer’s computational capacity in FLOP/s. How would you do it?

As discussed above, using the minimum FLOP/s sufficient to perform any of the tasks the computer is currently programmed to perform seems dicey: this depends on where the theoretical limits of algorithmic efficiency lie, relative to algorithms the computer is running. But suppose we ask, instead, about the minimum FLOP/s sufficient to perform any useful task that the computer could in principle be programmed to perform, given arbitrary programming skill. An arbitrarily skillful programmer, after all, would presumably employ maximally efficient algorithms to use this computer to its fullest capacity.

Applied to a computer actually performing FLOP/s, this approach does well on the “One-FLOP-per-FLOP” criterion. That is, even an arbitrarily skillful programmer still cannot wring more FLOP/s out of a V100 than the computer actually performs, assuming this programmer is restricted to the computational mechanisms intended by the system’s designers. So the minimum FLOP/s sufficient to do any of the tasks that this programmer could use a V100 to perform would presumably be 1e14.

And it also fits well with what we’re intuitively doing when we ask about a system’s computational capacity: that is, we’re asking how useful this system can be for computational tasks. For instance, if a task requires 1e17 FLOP/s, can I do it with this machine? This approach gives the answers you would get if the machine actually performed FLOP/s itself.

Can we apply this approach to the brain? The main conceptual challenge, I think, is defining what sorts of interventions would count as “programming” the brain.686

One option would be a restriction to external stimulation like e.g. talking, reading, etc. The tasks in question would be the set of tasks that any human could in principle be trained to perform, given arbitrary training time/arbitrarily skilled trainers. This would be limited by the brain’s existing methods of learning.
Another option would be to allow direct intervention on biophysical variables in the brain. Here, the main problem would be putting limits on which variables can be intervened on, and by how much. Intuitively, we want to disallow completely remoulding the brain into a fundamentally different device, or “use” of mechanisms and variables that the brain does not currently “use” to store or process information. I think it possible that this sort of restriction can be formulated with reasonable precision, but I haven’t tried.

One might also object that this approach will focus attention on tasks that are overall much more difficult than the ones that we generally have in mind when we’re thinking about human-level task performance.687 I think that this is very likely true, but this seems quite compatible with using it as a concept of the brain’s FLOP/s capacity, as it seems fine (indeed, inuitive) if this concept indicates the limitations on the brain’s task performance imposed by hardware constraints alone, as opposed to other ways the system is sub-optimal.

7.5 Summing up

Here is a summary of the various concepts I’ve discussed:

**Figure 22: Concepts of “brain FLOP/s”**
CONCEPT	ADVANTAGES	DISADVANTAGES
Minimum FLOP/s sufficient to match the brain’s task-performance	Simple; broad; focuses directly on task-performance.	Existing brains and AI systems provide only indirect evidence about the theoretical limits of algorithmic efficiency; questionably relevant to the FLOP/s we should expect human engineers to actually use.
Minimum FLOP/s sufficient to run a task-functional model that meets some brain-like-ness constraint, such as being a: “simulation of the brain” “reasonably brain-like model” model with X-very specific type of brain-like-ness model that captures “the algorithmic level” … “the lowest algorithmic level” … “the highest algorithmic level’ model with no states/transitions that don’t map to the brain	Restricted space of models makes theoretical limits of algorithmic efficiency somewhat less relevant, and neuroscientific evidence more directly relevant; connection to evolution may indicate a type of findability (without needing to include such findability in the definition).	Non-arbitrary brain-like-ness constraints are difficult to define with precision adequate to pick out a single number of FLOP/s; the systems we ultimately care about don’t need to be any particular degree of brain-like; functional method estimates are not based on systems designed to be brain-like; analogous standards, applied to a human-engineered computer, struggle to identify the FLOP/s that computer actually performs; the connection between evolutionary find-ability and specific computational models of the brain is often unclear.
Minimum FLOP/s sufficient to run a task-functional model that meets some findability constraint, such as being: the first such model humans will in fact create creatable by humans using X-type of training/resources/data etc. findable by X-type of hypothetical, evolution-like process creatable by an engineer “as good as evolution” at engineering	More directly relevant to the FLOP/s costs of models that we might expect humans to create, as opposed to ones that could exist in principle. “First model humans will in fact create” seems especially relevant (and functional method estimates may provide some purchase on it).	Implicating of difficult further questions about which models are what kinds of findable; findability constraints based on evolutionary hypotheticals/evolution-level engineers are also difficult to define precisely, and they are fairly alien from mainstream neuroscientific discourse – a fact which makes them difficult to solicit expert opinion about and/or evaluate using evidence of the type surveyed in the report.
Other computer analogies: “Operations per second in the brain” “FLOP/s the brain performs” “Minimum FLOP/s sufficient to perform any task the brain could be programmed to perform”	Variable. Focusing on the tasks that the brain can be “programmed” to perform does fairly well on One-FLOP-per-FLOP, and it fits well with what we might want a notion of “FLOP/s capacity” to do, while also side-stepping questions about the degree of algorithmic inefficiency in the brain.	In order to retain relevance to task-functional systems running on FLOP/s, “operations per second in the brain” and “FLOP/s the brain performs” seem to me to collapse back into something like the mechanistic method, and to correspondingly difficult questions about the theoretical limits of algorithmic efficiency, and/or brain-like-ness. Focusing on the tasks that the brain can be programmed to perform requires defining what interventions count as “programming” as opposed to reshaping – e.g., distinguishing between hardware and software, which is hard in the brain.

All these options have pros and cons. I don’t find any of them particularly satisfying, or obviously privileged as a way of thinking about the FLOP/s “equivalent” to the human brain. I’ve tried, in the body of the report, to use a broad framing; to avoid getting too bogged down in conceptual issues; and to survey evidence relevant to many narrower points of focus.

That said, it may be useful to offer some specific (though loose) probabilities for at least one of these. The point of focus I feel most familiar with is the FLOP/s required to run a task-functional model that satisfies a certain type of (somewhat arbitrary and ill-specified) brain-like-ness constraint, so I’ll offer some probabilities for that, keyed to the different mechanistic method ranges discussed above.

Best-guess probabilities for the minimum FLOP/s sufficient to run a task-functional model that satisfies the following conditions:

It includes units and connections between units corresponding to each neuron and synapse in the human brain (these units can have further internal structure, and the model can include other things as well).688
The functional role of these units and connections in task-performance is roughly similar to the functional role of the corresponding neurons and synapses in the brain.689

Caveats:

These are rough subjective probabilities offered about unsettled science. Hold them lightly.690
(2) is admittedly imprecise. My hope is that these numbers can be a helpful supplement to the more specific evidence surveyed in the report, but those who think the question ill-posed are free to ignore.691
This is not an estimate of the “FLOP/s equivalent to the brain.” It’s an estimate of “the FLOP/s required to run a specific type of model of the brain.” See Sections 7.1–7.4 on why I think the concept of “the FLOP/s equivalent to the brain” is underspecified.
I also think it very plausible that modeling every neuron/synapse is in some sense overkill (see Section 2.4.2) above), even in the context of various types of brain-like-ness constraints; and even more so without them.
I assume access to “sparse FLOP/s,” as discussed inSection 2.1.1.2.2.

FLOP/S RANGE	BEST-GUESS PROBABILITY	CENTRAL CONSIDERATIONS I HAVE IN MIND
<1e13	~15%	This is less than the estimate I’ve used for the spikes through synapses per second in the brain, so this range requires either that (a) this estimate is too high, or (b) satisfying the conditions above requires less than 1 FLOP per spike through synapse. (a) seems possible, as these parameters seem fairly unknown and I wouldn’t be that surprised if e.g. the average firing rate was <0.1 Hz, especially given the estimates in Lennie (2003). And (b) seems quite possible as well: a single FLOP might cover multiple spikes (for example, if what matters is a firing rate encoded in multiple spikes), and in general, it might well be possible to simplify what matters about the interactions between neurons in ways that aren’t salient to me (though note that simplifications that summarize groups of neurons are ruled out by the definition of the models in question). This sort of range also requires <100 FLOP/s per neuron for firing decisions, which, assuming at least 1 FLOP per firing decision, means you have to be computing firing decisions less than 100 times per second. My naive guess would’ve been that you need to do it more frequently, if a neuron is operating on e.g. 1 ms timescales, but I don’t have a great sense of the constraints here, and Sarpeshkar (2010) and Dr. Paul Christiano both seemed to think it possible to compute firing decisions less than once per timestep (see Section 2.1.2.5). And finally, this sort of range requires that the FLOP/s required to capture the contributions of all the other processes described in the mechanistic method section (e.g., dendritic computation, learning, alternative signaling mechanisms, etc.) are <1 FLOP per spike through synapse and <100 FLOP/s per neuron. Learning seems to me like the strongest contender for requiring more than this, but maybe it’s in the noise due to slower timescales, and/or only a small factor (e.g., 2× for something akin to gradient descent methods) on top of a very low-end baseline. So overall, it doesn’t seem like this range is ruled out, even assuming that we’re modeling individual neurons and synapses. But it requires that the FLOPs costs of everything be on the low side. And my very vague impression that many experts (even those sympathetic to the adequacy of comparatively simple models) would think this range too low. That said, it also covers possible levels of simplification that current theories/models do not countenance. And it seems generally reasonable, in contexts with this level of uncertainty, to keep error bars (in both directions) wide.
1e13-1e15	~30%	This is the range that emerges from the most common type of methodology in the literature, which budgets one operation per spike through synapse, and seems to assume that (i) operations like firing decisions, that scale with the number of neurons (~1e11) rather than number of synapses (~1e14-1e15), are in the noise, and (ii) so is everything else (including learning, alternative signaling mechanisms, and so on). As I discuss in Section 2.1.2.5, I think that assumption (i) is less solid if we budget FLOPs at synapses based on spike rates rather than timesteps, since the FLOPs costs of processes in a neuron could scale with timesteps per neuron per second, and timesteps are plausibly a few orders of magnitude more frequent than spikes, on average. Still, this range covers all neuron models with FLOP/s costs less than an Izhikevich spiking neuron model run with 1 ms timesteps (~1e15 FLOP/s for 1e11 neurons) – a set that includes many models in the integrate-and-fire family (run at similar temporal resolutions). So it still seems like a decent default budget for fairly simple models of neuron/synapse dynamics. Dendritic computation and learning seem like prominent processes missing from such a basic model, so this range requires that these don’t push us beyond 1e15 FLOP/s. If we would end up on the low end of this range (or below) absent those processes, this would leave at least one or two orders of magnitude for them to add, which seems like a reasonable amount of cushion to me, given the considerations surveyed in Sections 2.1.2.2 and 2.2. That said, my best guess would be that we need at least a few FLOPs per spike through synapse to cover short-term synaptic plasticity, so there would need to be less than ~3e14 spikes through synapses per second to leave room for this. And most basic type of integrate-and-fire neuron model already puts us at ~5e14 FLOP/s (assuming 1 ms timesteps), so this doesn’t leave much room for increases from dendritic computation.692 Overall, this range represents a simple default model that seems fairly plausible to me, despite not budgeting explicitly for these other complexities; and various experts appear to find this type of simple default persuasive.693
1e15-1e17	~30%.	This range is similar to the last, but with an extra factor of 100x budgeted to cover various possible complexities that came up in my research. Specifically, assuming the number of spikes through synapses falls in the range I’ve used (1e13-1e15), it covers 100-10,000 FLOPs per spike through synapse (this would cover Sarpeshkar’s (2010) 50 FLOPs per spike through synapse for synaptic filtering and learning; along with various models of learning discussed in Section 2.2.2) as well as 1e4-1e6 FLOP/s per neuron (this would cover, on the top end, single-compartment Hodgkin-Huxley models run with 0.1 ms timesteps – a level of modeling detail/complexity that I expect many computational neuroscientists to consider unnecessary). Overall, this range seems very plausibly adequate to me, and various experts I engaged with seemed to agree.694 I’m much less confident that it’s required, but as mentioned above, my best guess is that you need at least a few FLOPs per spike through synapse to cover short-term synaptic plasticity, and plausibly more for more complex forms of learning; and it seems plausible to me that ultimately, FLOPs budgets for firing decisions (including dendritic computation) are somewhere between Izhikevich spiking neurons and Hodgkin-Huxley models. But as discussed above, lower ranges seem plausible as well.
1e17-1e21	~20%	As I noted in the report, I don’t see a lot of strong positive evidence that budgets this high are required. The most salient considerations for me are (a) the large FLOP/s costs of various DNN models of neuron behavior discussed in the report, which could indicate types complexity that lower budgets do not countenance, and (b) if you budget at least one FLOP per timestep per synapse (as opposed to per spike through synapse), along with <1 ms timesteps, and>1e14 synapses, then you get above 1e17 FLOP/s, and it seems possible that sufficiently important and unsimplifiable changes are taking place at synapses this frequently (for example, changes involved in learning). Some experts also seem to treat “time-steps per second per variable” as a default method of generating FLOP/s estimates (and there may be many variables per synapse – see e.g. Benna and Fusi (2016)). Beyond this, the other central pushes in this direction I feel involve (a) the general costliness of low-level modeling of biological and chemical processes; (b) the possibility that learning and dendritic computation introduce more complexity than 1e17 FLOP/s budgets for; (c) the fact that this range covers four orders of magnitude; (d) the possibility of some other type of unknown error or mistake, not currently on my radar, that pushes required FLOP/s into this range, and (e) an expectation that a decent number of experts would give estimates in this range as well.
>1e21	~5%	Numbers this high start to push past the upper bounds discussed in the limit method section. These bounds don’t seem airtight to me, but I feel reasonably persuaded by the hardware arguments discussed in Section 4.2.2 (e.g., I expect the brain to be dissipating at least a few kT per FLOP required to meet the conditions above, and to use at least 1 ATP, of which it has a maximum of ~1e20/s available). I also don’t see a lot of positive reason to go this high (though the DNN models I mentioned are one exception to this); other methods generally point to lower numbers; and some experts I spoke to were very confident that numbers in this range are substantial overkill. That said, I also put macroscopic probability on the possibility that these experts and arguments (possibly together with the broader paradigms they assume) are misguided in some way; that the conditions above, rightly understood, somehow end up requiring very large FLOP/s budgets (though this last one feels more like uncertainty about the concepts at stake in the question than uncertainty about the answer); and/or that the task-relevant causal structure in the brain is just intrinsically very difficult to replicate using FLOP/s (possibly because it draws on analog physical primitives, continuous/very fine-grained temporal dynamics, and/or complex biochemical interactions that are cheap for the brain, but very expensive to capture with FLOP/s). And in general, long tails seem appropriate in contexts with this level of uncertainty.

8 Sources

DOCUMENT	SOURCE
Aaronson (2011)	Source
Abraham and Philpot (2009)	Source
Achard and De Schutter (2006)	Source
Adam (2019)	Source
Adams (2013)	Source
Agarwal et al. (2017)	Source
AI Impacts, “Brain performance in FLOPS”	Source
AI Impacts, “Brain performance in TEPS”	Source
AI Impacts, “Glial Signaling”	Source
AI Impacts, “Neuron firing rates in humans”	Source
AI Impacts, “Scale of the Human Brain”	Source
AI Impacts, “The cost of TEPS”	Source
AI Impacts, “How AI timelines are estimated”	Source
Aiello (1997)	Source
Aiello and Wheeler (1995)	Source
Ajay and Bhalla (2006)	Source
Alger (2002)	Source
Amodei and Hernandez (2018)	Source
Amodei et al. (2016)	Source
Ananthanarayanan et al. (2009)	Source
Anastassiou and Koch (2015)	Source
Anastassiou et al. (2011)	Source
Andrade-Moraes et al. (2013)	Source
Angel et al. (2012)	Source
Antolík et al. (2016)	Source
Araque and Navarrete (2010)	Source
Araque et al. (2000)	Source
Araque et al. (2001)	Source
Arizona Power Authority, “History of Hoover”	Source
Arkhipov et al. (2018)	Source
Asadi and Navi (2007)	Source
Aschoff et al. (1971)	Source
Ashida et al. (2007)	Source
Astrup et al. (1981a)	Source
Attwell and Laughlin (2001)	Source
Azevedo et al. (2009)	Source
Backyard Brains, “Experiment: Comparing Speeds of Two Nerve Fiber Sizes”	Source
Balasubramanian and Berry (2002)	Source
Balasubramanian et al. (2001)	Source
Baldwin and Eroglu (2017)	Source
Banino et al. (2018)	Source
Barbu et al. (2019)	Source
Barth and Poulet (2012)	Source
Bartheld et al. (2016)	Source
Bartol et al. (2015)	Source
Bartol Jr et al. (2015)	Source
Bartunov et al. (2018)	Source
Bashivan et al. (2019)	Source
Batty et al. (2017)	Source
Bell (1999)	Source
Bengio et al. (2015)	Source
Beniaguev et al. (2019)	Source
Beniaguev et al. (2020)	Source
Benna and Fusi (2016)	Source
Bennett (1973)	Source
Bennett (1981)	Source
Bennett (1989)	Source
Bennett (2003)	Source
Bennett and Zukin (2004)	Source
Bennett et al. (1991)	Source
Bernardinell et al. (2004)	Source
Berry et al. (1999)	Source
Bezzi et al. (2004)	Source
Bhalla (2004)	Source
Bhalla (2014)	Source
Bi and Poo (2001)	Source
Bialowas et al. (2015)	Source
Biederman (1987)	Source
Bileh et al. (2020)	Source
Bindocci et al. (2017)	Source
Bischofberger et al. (2002)	Source
Blanding (2017)	Source
Blinkow and Glezer (1968)	Source
Bliss and Lømo (1973)	Source
Bollmann et al. (2000)	Source
Bomash et al. (2013)	Source
Bostrom (1998)	Source
Bouhours et al. (2011)	Source
Bower and Beeman (1995)	Source
Brain-Score	Source
Brain-Score, “Leaderboard”	Source
Brains in Silicon, “Publications”	Source
Braitenberg and Schüz (1998)	Source
Branco, Clark, and Häusser (2010)	Source
Brette (2015)	Source
Brette and Gerstner (2005)	Source
Brody and Yue (2000)	Source
Brown et al. (2020)	Source
Brownlee (2019a)	Source
Brownlee (2019b)	Source
Bruzzone et al. (1996)	Source
Bub (2002)	Source
Bucurenciu et al. (2008)	Source
Bullock et al. (1990)	Source
Bullock et al. (1994)	Source
Bullock et al. (2005)	Source
Burgoyne and Morgan (2003)	Source
Burke (2000)	Source
Burkitt (2006)	Source
Burr et al. (1994)	Source
Burrows (1996)	Source
Bush et al. (2015)	Source
Bushong et al. (2002)	Source
Bussler (2020)	Source
Büssow (1980)	Source
Butt et al. (2004)	Source
Button et al. (2013)	Source
Buzaki and Mizuseki (2014)	Source
Cadena et al. (2017)	Source
Cadena et al. (2019)	Source
Cantero et al. (2018)	Source
Carandini (2012)	Source
Carandini et al. (2005)	Source
Cariani (2011)	Source
Carp (2012)	Source
Carr and Boudreau (1993b)	Source
Carr and Konishi (1990)	Source
Castet and Masson (2000)	Source
Cell Biology By The Numbers, “How much energy is released in ATP hydrolysis?”	Source
Cerebras, “Cerebras Wafer Scale Engine: An Introduction”	Source
Chaigneau et al. (2003)	Source
Chang (2019)	Source
Cheng et al. (2018)	Source
Cheramy (1981)	Source
Chiang et al. (2019)	Source
Chong et al. (2016)	Source
Christie and Jahr (2009)	Source
Christie et al. (2011)	Source
Citri and Malenka (2008)	Source
Clark (2020)	Source
Clopath (2012)	Source
Cochran et al. (1984)	Source
Collel and Fauquet (2015)	Source
Collins et al. (2016)	Source
Compute Canada, “Technical Glossary”	Source
Cooke and Bear (2014)	Source
Cooke et al. (2015)	Source
Crick (1984)	Source
Crick (1989)	Source
Critch (2016)	Source
Cudmore and Desai (2008)	Source
Cueva and Wei (2018)	Source
Dalrymple (2011)	Source
Daniel et al. (2013)	Source
Dayan and Abbott (2001)	Source
De Castro (2013)	Source
de Faria, Jr. et al. (2019)	Source
Deans et al. (2007)	Source
Debanne et al. (2013)	Source
Deli et al. (2017)	Source
Deneve et al. (2001)	Source
Dermietzel et al. (1989)	Source
Dettmers (2015)	Source
Di Castro et al. (2011)	Source
Diamond (1996)	Source
Dix (2005)	Source
Dongerra et al. (2003)	Source
Doose et al. (2016)	Source
Doron et al. (2017)	Source
Dowling (2007)	Source
Drescher (2006)	Source
Drexler (2019)	Source
Dreyfus (1972)	Source
Dugladze et al. (2012)	Source
Dunn et al. (2005)	Source
Earman and Norton (1998)	Source
Einevoll et al. (2015)	Source
Eliasmith (2013)	Source
Eliasmith et al. (2012)	Source
Elliott (2011)	Source
Elsayed et al. (2018)	Source
Engl and Attwell (2015)	Source
Enoki et al. (2009)	Source
Erdem and Hasselmo (2012)	Source
Fain et al. (2001)	Source
Faisal (2012)	Source
Faisal et al. (2008)	Source
Faria et al. (2019)	Source
Fathom Computing	Source
Fedchyshyn and Wang (2005)	Source
Feyman (1996)	Source
Fiete et al. (2008)	Source
Fischer et al. (2008)	Source
Fisher (2015)	Source
Fortune and Rose (2001)	Source
Fotowat (2010)	Source
Fotowat and Gabbiani (2011)	Source
Francis et al. (2003)	Source
Frank (2018)	Source
Frank and Ammer (2001)	Source
Frankle and Carbin (2018)	Source
Fredkin and Toffoli (1982)	Source
Freitas (1996)	Source
Friston (2010)	Source
Fröhlich and McCormick (2010)	Source
Fuhrmann et al. (2001)	Source
Funabiki et al. (1998)	Source
Funabiki et al. (2011)	Source
Funke et al. (2020)	Source
Fusi and Abbott (2007)	Source
Future of Life, “Steven Pinker and Stuart Russell on the Foundations, Benefits, and Possible Existential Threat of AI”	Source
Gütig and Sompolinsky (2006)	Source
Gabbiani et al. (2002)	Source
Gallant et al. (1993)	Source
Gallant et al. (1996)	Source
Gallego et al. (2017)	Source
Gardner‐Medwin (1983)	Source
Garg (2015)	Source
Garis et al. (2010)	Source
Gatys et al. (2015)	Source
Geiger and Jonas (2000)	Source
Geirhos et al. (2018)	Source
Geirhos et al. (2020)	Source
Gelal et al. (2016)	Source
Georgopoulos et al. (1986)	Source
Gerstner and Naud (2009)	Source
Gerstner et al. (2018)	Source
Get Body Smart, “Visual Cortex Areas”	Source
Ghanbari et al. (2017)	Source
Giaume (2010)	Source
Giaume et al. (2010)	Source
Gidon et al. (2020)	Source
Gilbert (2013)	Source
GitHub, “convnet-burden”	Source
GitHub, “neuron_as_deep_net”	Source
GitHub, “Report for resnet-101”	Source
GitHub, “Report for SE-ResNet-152”	Source
Gittis et al. (2010)	Source
Goldman et al. (2001)	Source
Gollisch and Meister (2008)	Source
Gollisch and Meister (2010)	Source
Goodenough et al. (1996)	Source
Google Cloud, “Tensor Processing Unit”	Source
Grace et al. (2018)	Source
Graph 500	Source
Graubard et al. (1980)	Source
Green and Swets (1966)	Source
Greenberg and Ziff (1984)	Source
Greenberg et al. (1985)	Source
Greenberg et al. (1986)	Source
Greenemeier (2009)	Source
Greydanus (2017)	Source
Gross (2008)	Source
Grossberg (1987)	Source
Grutzendler et al. (2002)	Source
Guerguiev et al. (2017)	Source
Guo et al. (2014)	Source
Guthrie et al. (1999)	Source
Hänninen and Takala (2010)	Source
Hänninen et al. (2011)	Source
Hafting et al. (2005)	Source
Halassa et al. (2007b)	Source
Halassa et al. (2009)	Source
Hamilton (2015)	Source
Hamzelou (2020)	Source
Hansel et al. (1998)	Source
Hanson (2011)	Source
Hanson (2016)	Source
Hanson et al. (2019)	Source
Harris (2008)	Source
Harris and Attwell (2012)	Source
Hasenstaub et al. (2010)	Source
Hassabis et al. (2017)	Source
Haug (1986)	Source
Hay et al. (2011)	Source
Hayworth (2019)	Source
He et al. (2002)	Source
Héja et al. (2009)	Source
Hemmo and Shenker (2019)	Source
Hendricks et al. (2020)	Source
Henneberger et al. (2010)	Source
Herculano-Houzel (2009)	Source
Herculano-Houzel and Lent (2005)	Source
Herz et al. (2006)	Source
Hess et al. (2000)	Source
Hines and Carnevale (1997)	Source
Hinton (2011)	Source
Hinton et al. (2006)	Source
Hochberg (2012)	Source
Hoffmann and Pfeifer (2012)	Source
Hollemans (2018)	Source
Holtmaat et al. (2005)	Source
Hood (1998)	Source
Hoppensteadt and Izhikevich (2001)	Source
Hossain et al. (2018)	Source
Howarth et al. (2010)	Source
Howarth et al. (2012)	Source
Howell et al. (2000)	Source
Hu and Wu (2004)	Source
Huang and Neher (1996)	Source
Hubel and Wiesel (1959)	Source
Hubel and Wisel (1959)	Source
Huys et al. (2006)	Source
ImageNet	Source
ImageNet Winning CNN Architectures (ILSVRC)	Source
ImageNet, “Summary and Statistics”	Source
Irvine (2000)	Source
Izhikevich (2003)	Source
Izhikevich (2004)	Source
Izhikevich and Edelman (2007)	Source
Izhikevich et al., “why did I do that?”	Source
Jabr (2012a)	Source
Jabr (2012b)	Source
Jackson et al. (1991)	Source
Jadi et al. (2014)	Source
Jeffreys (1995)	Source
Jenkins et al. (2018)	Source
Johansson et al. (2014)	Source
Johnson (1999)	Source
Jolivet et al. (2006a)	Source
Jolivet et al. (2008a)	Source
Jolivet et al. (2008b)	Source
Jonas (2014)	Source
Jonas and Kording (2016)	Source
Jones and Gabbiani (2012)	Source
Jourdain et al. (2007)	Source
Journal of Evolution and Technology, “Peer Commentary on Moravec’s Paper”	Source
Juusola et al. (1996)	Source
Káradóttir et al. (2008)	Source
Kahn and Mann (2020)	Source
Kandel et al. (2013a)	Source
Kandel et al. (2013b)	Source
Kandel et al. (2013c)	Source
Kaplan (2018)	Source
Kaplan et al. (2020)	Source
Kaplanis et al. (2018)	Source
Karpathy (2012)	Source
Karpathy (2014a)	Source
Karpathy (2014b)	Source
Kawaguchi and Sakaba (2015)	Source
Keat et al. (2001)	Source
Kell et al. (2018)	Source
Kempes et al. (2017)	Source
Kety (1957)	Source
Keysers et al. (2001)	Source
Khaligh-Razavi and Kiregeskorte (2014)	Source
Khan (2020)	Source
Khan Academy, “Neurotransmitters and receptors”	Source
Khan Academy, “Overview of neuron structure and function”	Source
Khan Academy, “Q & A: Neuron depolarization, hyperpolarization, and action potentials”	Source
Khan Academy, “The membrane potential”	Source
Khan Academy, “The synapse”	Source
Kim (2014)	Source
Kindel et al. (2019)	Source
Kiregeskorte (2015)	Source
Kish (2016)	Source
Kleinfield et al. (2019)	Source
Kleinjung et al. (2010)	Source
Klindt et al. (2017)	Source
Knudsen et al. (1979)	Source
Knuth (1997)	Source
Kobayashi et al. (2009)	Source
Koch (1999)	Source
Koch (2016)	Source
Koch et al. (2004)	Source
Kole et al. (2007)	Source
Kolesnikov et al. (2020)	Source
Kostyaev (2016)	Source
Kozlov et al. (2006)	Source
Kriegeskorte (2015)	Source
Krizhevsky et al. (2009)	Source
Krizhevsky et al. (2012)	Source
Krueger (2008)	Source
Kruijer et al. (1984)	Source
Kuba et al. (2005)	Source
Kuba et al. (2006)	Source
Kuga et al. (2011)	Source
Kumar (2020)	Source
Kurzweil (1999)	Source
Kurzweil (2005)	Source
Kurzweil (2012)	Source
López-Suárex et al. (2016)	Source
Lahiri and Ganguli (2013)	Source
Lake et al. (2015)	Source
Lamb et al. (2019)	Source
Landauer (1961)	Source
Langille and Brown (2018)	Source
Lau and Nathans (1987)	Source
Laughlin (2001)	Source
Laughlin et al. (1998)	Source
Lauritzen (2001)	Source
LeCun and Bengio (2007)	Source
LeCun et al. (2015)	Source
Lee (2011)	Source
Lee (2016)	Source
Lee et al. (1988)	Source
Lee et al. (2010)	Source
Lee et al. (2015)	Source
Leng and Ludwig (2008)	Source
Lennie (2003)	Source
Levy and Baxter (1996)	Source
Levy and Baxter (2002)	Source
Levy et al. (2014)	Source
Li et al. (2019)	Source
Liao et al. (2015)	Source
Lillicrap and Kording (2019)	Source
Lillicrap et al. (2016)	Source
Lind et al. (2018)	Source
Lindsay (2020)	Source
Litt et al. (2006)	Source
Llinás (2008)	Source
Llinás et al. (2004)	Source
Lloyd (2000)	Source
Lodish et al. (2000)	Source
Lodish et al. (2008)	Source
London and Häusser (2005)	Source
Lucas (1961)	Source
Luczak et al. (2015)	Source
Lumen Learning, “Action Potential”	Source
Lumen Learning, “Resting Membrane Potential”	Source
Luscher and Malenka (2012)	Source
Machine Intelligence Research Institute, “Erik DeBenedictis on supercomputing”	Source
Machine Intelligence Research Institute, “Mike Frank on reversible computing”	Source
Macleod, Horiuchi et al. (2007)	Source
Maheswaranathan et al. (2019)	Source
Mainen and Sejnowski (1995)	Source
Mains and Eipper (1999)	Source
Major, Larkum, and Schiller (2013)	Source
Malickas (2007)	Source
Malonek et al. (1997)	Source
Marblestone et al. (2013)	Source
Marcus (2015)	Source
Marder (2012)	Source
Marder and Goaillard (2006)	Source
Markram et al. (1997)	Source
Markram et al. (2015)	Source
Maroney (2005)	Source
Maroney (2018)	Source
Marr (1982)	Source
Martin et al. (2006)	Source
Martins (2012)	Source
Martins et al. (2012)	Source
Mathematical Association of America, “Putnam Competition”	Source
Mathis et al. (2012)	Source
Matsuura et al. (1999)	Source
Maturna et al. (1960)	Source
McAnany and Alexander (2009)	Source
McCandlish et al. (2018)	Source
McDermott (2014)	Source
McDonnel and Ward (2011)	Source
McFadden and Al-Khalili (2018)	Source
McLaughlin (2000)	Source
McNaughton et al. (2006)	Source
Mead (1989)	Source
Mead (1990)	Source
Medina et al. (2000)	Source
Medlock (2017)	Source
Mehar (2020)	Source
Mehta and Schwab (2012)	Source
Mehta et al. (2016)	Source
Meister et al. (2013)	Source
Merel et al. (2020)	Source
Merkle (1989)	Source
Mermillod et al. (2013)	Source
Metaculus, “What will the necessary computational power to replicate human mental capability turn out to be?”	Source
Metric Conversions, “Celsius to Kelvin”	Source
Miller (2018)	Source
Miller et al. (2014)	Source
Min and Nevian (2012)	Source
Min et al. (2012)	Source
Ming and Song (2011)	Source
MIT Open Courseware, “Lecture 1.2: Gabriel Kreiman – Computational Roles of Neural Feedback”	Source
Mnih et al. (2015)	Source
Moehlis et al. (2006)	Source
Monday et al. (2018)	Source
Moore and Cao (2008)	Source
Moore et al. (2017)	Source
Mora-Bermúdez et al. (2016)	Source
Mora-Bermúdez (2016)	Source
Moravčík et al. (2017)	Source
Moravec (1988)	Source
Moravec (1998)	Source
Moravec (2008)	Source
Moreno-Jimenez et al. (2019)	Source
Moser and Moser (2007)	Source
Movshon et al. (1978a)	Source
Mu et al. (2019)	Source
Muehlhauser (2017a)	Source
Muehlhauser (2017b)	Source
Müller and Hoffmann (2017)	Source
Müller et al. (1984)	Source
Munno and Syed (2003)	Source
Nadim and Bucher (2014)	Source
Nadim and Manor (2000)	Source
Napper and Harvey (1988)	Source
Nature Communications, “Building brain-inspired computing”	Source
Nature, “Far To Go”	Source
Naud and Gerstner (2012a)	Source
Naud and Gerstner (2012b)	Source
Naud et al. (2009)	Source
Naud et al. (2014)	Source
Neishabouri and Faisal (2014)	Source
Nelson and Nunneley (1998)	Source
Nett et al. (2002)	Source
Next Big Future, “Henry Markram Calls the IBM Cat Scale Brain Simulation a Hoax”	Source
Nicolesis and Circuel (2015)	Source
Nielsen (2015)	Source
Nimmerjahn et al. (2009)	Source
Nirenberg and Pandarinath (2012)	Source
Niven et al. (2007)	Source
Nordhaus (2001)	Source
Norton (2004)	Source
Norup Nielsen and Lauritzen (2001)	Source
NVIDIA, “Steel for the AI Age: DGX SuperPOD Reaches New Heights with NVIDIA DGX A100”	Source
NVIDIA, “NVIDIA Tesla V100 GPU Architecture”	Source
NVIDIA, “NVIDIA V100 Tensor Core GPU”	Source
Oberheim et al. (2006)	Source
Okun et al. (2015)	Source
Olah et al. (2018)	Source
Olah et al. (2020a)	Source
Olah et al. (2020b)	Source
Olshausen and Field (2005)	Source
OpenAI et al. (2019)	Source
OpenAI, “Solving Rubik’s Cube with a Robot Hand”	Source
OpenStax, “Anatomy and Physiology”	Source
Otsu et al. (2015)	Source
Ouldridge (2017)	Source
Ouldridge and ten Wolde (2017)	Source
Pakkenberg and Gundersen (1997)	Source
Pakkenberg et al. (2002)	Source
Pakkenberg et al. (2003)	Source
Panatier et al. (2011)	Source
Papers with Code, “Object Detection on COCO test-dev”	Source
Park and Dunlap (1998)	Source
Parpura and Zorec (2010)	Source
Pascual et al. (2005)	Source
Pasupathy and Connor (1999)	Source
Pasupathy and Connor (2001)	Source
Pavone et al. (2013)	Source
Payeur et al. (2019)	Source
Peña et al. (1996)	Source
Penrose (1994)	Source
Penrose and Hameroff (2011)	Source
Perea and Araque (2005)	Source
Peterson (2009)	Source
Piccinini (2017)	Source
Piccinini and Scarantino (2011)	Source
Pillow et al. (2005)	Source
Poirazi and Papoutsi (2020)	Source
Poirazi et al. (2003)	Source
Poldrack et al. (2017)	Source
Polsky, Mel, and Schiller (2004)	Source
Porter and McCarthy (1997)	Source
Potter et al. (2013)	Source
Pozzorini et al. (2015)	Source
Prakriya and Mennerick (2000)	Source
Principles of Computational Modelling in Neuroscience, “Figure Code examples.all”	Source
Prinz et al. (2004)	Source
Pulsifer et al. (2004)	Source
Purves et al. (2001)	Source
Putnam Problems (2018)	Source
Qiu et al. (2015)	Source
Queensland Brain Institute, “Long-term synaptic plasticity”	Source
Radford et al. (2019)	Source
Rakic (2008)	Source
Rall (1964)	Source
Rama et al. (2015a)	Source
Rama et al. (2015b)	Source
Raphael et al. (2010)	Source
Rauch et al. (2003)	Source
Ravi (2018)	Source
Raymond et al. (1996)	Source
Reardon et al. (2018)	Source
Recht et al. (2019)	Source
Reyes (2001)	Source
Reyes et al. (1996)	Source
Rieke and Rudd (2009)	Source
Rieke et al. (1997)	Source
Roe et al. (2020)	Source
Rolfe and Brown (1997)	Source
Rosenfeld et al. (2018)	Source
Roska and Werblin (2003)	Source
Rupprecht et al. (2019)	Source
Russakovsky et al. (2014)	Source
Russo (2017)	Source
Sabatini and Regehr (1997)	Source
Sadtler et al. (2014)	Source
Sagawa (2014)	Source
Sakry et al. (2014)	Source
Saleem et al. (2017)	Source
Sandberg (2013)	Source
Sandberg (2016)	Source
Sandberg and Bostrom (2008)	Source
Santello et al. (2011)	Source
Santos-Carvalho et al. (2015)	Source
Sarma et al. (2018)	Source
Sarpeshkar (1997)	Source
Sarpeshkar (1998)	Source
Sarpeshkar (2010)	Source
Sarpeshkar (2013)	Source
Sarpeshkar (2014)	Source
Sartori et al. (2014)	Source
Sasaki et al. (2012)	Source
Scellier and Bengio, 2016	Source
Schecter et al. (2017)	Source
Schlaepfer et al. (2006)	Source
Schmidt-Hiever et al. (2017)	Source
Schneider and Gersting (2018)	Source
Schrimpf et al. (2018)	Source
Schroeder (2000)	Source
Schubert et al. (2011)	Source
Schultz (2007)	Source
Schulz (2010)	Source
Schummers et al. (2008)	Source
Schwartz and Javitch (2013)	Source
Science Direct, “Membrane Potential”	Source
Science Direct, “Pyramidal Cell”	Source
ScienceDirect, “Endocannabinoids”	Source
Scott et al. (2008)	Source
Segev and Rall (1998)	Source
Selverston (2008)	Source
Semiconductor Industry Association, “2015 International Technology Roadmap for Semiconductors (ITRS)”	Source
Serre (2019)	Source
Seung (2012)	Source
Shadlen and Newsome (1998)	Source
Shapley and Enroth-Cugell (1984)	Source
Sheffield (2011)	Source
Shenoy et al. (2013)	Source
Shepherd (1990)	Source
Sheth et al. (2004)	Source
Shoham et al. (2005)	Source
Shouval (2007)	Source
Shu et al. (2006)	Source
Shu et al. (2007)	Source
Shulz and Jacob (2010)	Source
Siegelbaum and Koester (2013a)	Source
Siegelbaum and Koester (2013b)	Source
Siegelbaum and Koester (2013c)	Source
Siegelbaum and Koester (2013d)	Source
Siegelbaum et al. (2013a)	Source
Siegelbaum et al. (2013b)	Source
Siegelbaum et al. (2013c)	Source
Silver et al. (2016)	Source
Sipser (2013)	Source
Sjöström and Gerstner (2010)	Source
Skora et al. (2017)	Source
Slee et al. (2010)	Source
Smith et al. (2019)	Source
Sokoloff (1960)	Source
Sokoloff et al. (1977)	Source
Song et al. (2007)	Source
Sorrells et al. (2018)	Source
Srinivasan et al. (2015)	Source
Stack Exchange, “Number of FLOPs (floating point operations) for exponentiation”	Source
Stack Overflow, “How many FLOPs does tanh need?”	Source
Stanford Encyclopedia of Philosophy, “Embodied Cognition”	Source
Stanford Medicine, “Stanford Artificial Retina Project \| Competition”	Source
Steil (2011)	Source
Stevenson and Kording (2011)	Source
Stobart et al. (2018a)	Source
Stobart et al. (2018b)	Source
Stopfer et al. (2003)	Source
Storrs et al. (2020)	Source
Street (2016)	Source
Stringer et al. (2018)	Source
Stuart and Spruston (2015)	Source
Su et al. (2012)	Source
Such et al. (2018)	Source
Sun (2017)	Source
Swaminathan (2008)	Source
Swenson (2006)	Source
Szegedy et al. (2013)	Source
Szegedy et al. (2014)	Source
Szucs and P.A. loannidis (2017)	Source
Takahashi (2012)	Source
Tan and Le (2019)	Source
Tan et al. (2019)	Source
Tan et al. (2020)	Source
Tang et al. (2001)	Source
Tao and Poo (2001)	Source
Taylor et al. (2000)	Source
TED, “Robin Hanson: What would happen if we upload our brains to computers?”	Source
Tegmark (1999)	Source
Tegmark (2017)	Source
Thagard (2002)	Source
The Physics Factbook, “Energy in ATP”	Source
The Physics Factbook, “Power of a Human Brain”	Source
The Physics Factbook, “Power of a Human”	Source
The Physics Factbook, “Volume of a Human”	Source
Theodosis et al. (2008)	Source
Thinkmate, “NVIDIA® Tesla™ V100 GPU Computing Accelerator”	Source
Thomé (2019)	Source
Thomson and Kristan (2006)	Source
Thorpe, Fize, and Marlot (1996)	Source
Top 500, “June 2020”	Source
Top 500, “November 2019”	Source
Tosdyks and Wu (2013)	Source
Toutounian and Ataei (2009)	Source
Trafton’s (2014)	Source
Trenholm and Awatramani (2019)	Source
Trenholm et al. (2013)	Source
Trettenbrein (2016)	Source
Trussell (1999)	Source
Tsien (2013)	Source
Tsodyks and Wu (2013)	Source
Tsodyks et al. (1999)	Source
Tsubo et al. (2012)	Source
Tuszynski (2006)	Source
Twitter, “David Pfau”	Source
Twitter, “Kevin Lacker”	Source
Twitter, “Sharif Shameem”	Source
Twitter, “Tim Brady”	Source
Tzilivaki et al. (2019)	Source
Ujfalussy et al. (2018)	Source
Urbanczik and Senn (2009)	Source
Uttal (2012)	Source
Vaccaro and Barnett (2011)	Source
Vallbo et al. (1984)	Source
van den Oord et al. (2016)	Source
van Steveninck et al. (1997)	Source
Vanzetta et al. (2004)	Source
Varpula (2013)	Source
Venance et al. (1997)	Source
Verkhratsky and Butt, eds. (2013)	Source
Vinyals et al. (2019)	Source
VisualChips, “6502 – simulating in real time on an FPGA”	Source
VisualChips, “Visual Transistor-level Simulation of the 6502 CPU and other chips!”	Source
Volkmann (1986)	Source
Volterra and Meldolesi (2005)	Source
von Bartheld et al. (2016)	Source
von Neumann (1958)	Source
Vroman et al. (2013)	Source
Vul and Pashler (2017)	Source
Waldrop (2012)	Source
Walsh (1999)	Source
Wang et al. (2006)	Source
Wang et al. (2009)	Source
Wang et al. (2010)	Source
Wang et al. (2014)	Source
Wang et al. (2016)	Source
Wärnberg and Kumar (2017)	Source
Watts et al. (2018)	Source
Weiss and Faber (2010)	Source
Weiss et al. (2018)	Source
White et al. (1984)	Source
Wikimedia, “Receptive field.png”	Source
Wikipedia, “Action potential”	Source
Wikipedia, “Allocortex”	Source
Wikipedia, “Angular diameter”	Source
Wikipedia, “Astrocyte”	Source
Wikipedia, “Boltzmann’s constant”	Source
Wikipedia, “Boolean satisfiability problem”	Source
Wikipedia, “Brain size”	Source
Wikipedia, “Breadth-first search”	Source
Wikipedia, “Caenorhabditis elegans”	Source
Wikipedia, “Cerebellar agenesis”	Source
Wikipedia, “Cerebellar granule cell”	Source
Wikipedia, “Cerebral cortex”	Source
Wikipedia, “Chemical synapse”	Source
Wikipedia, “Conditional entropy”	Source
Wikipedia, “Convolutional neural network”	Source
Wikipedia, “Decapoda”	Source
Wikipedia, “Electrical synapse”	Source
Wikipedia, “Electroencephalography”	Source
Wikipedia, “Entropy (information theory)”	Source
Wikipedia, “Entropy (statistical thermodynamics)”	Source
Wikipedia, “Excitatory postsynaptic potential”	Source
Wikipedia, “Exponential decay”	Source
Wikipedia, “Extended mind thesis”	Source
Wikipedia, “Floating-point arithmetic”	Source
WIkipedia, “Fugaku (supercomputer)”	Source
Wikipedia, “Functional magnetic resonance imaging”	Source
Wikipedia, “Gabor filter”	Source
Wikipedia, “Gap junction”	Source
Wikipedia, “Glia”	Source
Wikipedia, “Grid cell”	Source
Wikipedia, “Hemispherectomy”	Source
Wikipedia, “Hodgkin-Huxley model”	Source
Wikipedia, “Human body temperature”	Source
Wikipedia, “Injective function”	Source
Wikipedia, “Ion”	Source
Wikipedia, “Landauer’s principle”	Source
Wikipedia, “Membrane”	Source
Wikipedia, “Microstates (statistical mechanics)	Source
Wikipedia, “MOS Technology 6502”	Source
Wikipedia, “Multiply-accumulate operation”	Source
Wikipedia, “Neocortex”	Source
Wikipedia, “Neural circuit”	Source
Wikipedia, “Neuromorphic engineering”	Source
Wikipedia, “Neuropeptide”	Source
Wikipedia, “Perineuronal net”	Source
Wikipedia, “Pyramidal cell”	Source
Wikipedia, “Recurrent neural network”	Source
Wikipedia, “RSA numbers”	Source
Wikipedia, “Scientific notation”	Source
Wikipedia, “Synapse”	Source
Wikipedia, “Synaptic weight”	Source
Wikipedia, “Thermodynamic temperature”	Source
Wikipedia, “Traversed edges per second”	Source
Wikipedia, “Visual cortex”	Source
Wikipedia, “White matter”	Source
Wilson and Foglia (2015)	Source
Winship et al. (2007)	Source
WolframAlpha	Source
Wolpert (2016)	Source
Wolpert (2019a)	Source
Wolpert (2019b)	Source
Wong-Riley (1989)	Source
Wu et al. (2016)	Source
Yamins and DiCarlo (2016)	Source
Yamins et al. (2014)	Source
Yang and Calakos (2013)	Source
Yang and Wang (2006)	Source
Yang et al. (1998)	Source
Yap and Greenberg (2018)	Source
YouTube, “Analog Supercomputers: From Quantum Atom to Living Body \| Rahul Sarpeshkar \| TEDxDartmouth”	Source
YouTube, “Biophysics of object segmentation in a collision-detecting neuron”	Source
YouTube, “Bush dodges flying shoes”	Source
YouTube, “Homo digitalis – Henry Markram”	Source
YouTube, “Hubel and Wiesel Cat Experiment”	Source
YouTube, “Jonathan Pillow – Tutorial: Statistical models for neural data – Part 1 (Cosyne 2018)”	Source
YouTube, “Lecture 7: Information Processing in the Brain”	Source
YouTube, “Markus Meister, Nueral computations in the retina: from photons to behavior: 2016 Sharp Lecture”	Source
YouTube, “Matt Botvinick: Neuroscience, Psychology, and AI at DeepMind \| Lex Fridman Podcast #106”	Source
YouTube, “Neural networks and the brain: from the retina to semantic cognition – Surya Ganguli”	Source
YouTube, “Neuralink Launch Event”	Source
YouTube, “Quantum Processing in the Brain? (Matthew PA Fisher)”	Source
YouTube, “Stanford Seminar – Generalized Reversible Computing and the Unconventional Computing Landscape”	Source
YouTube, “The Stilwell Brain”	Source
YouTube, “Yann LeCun – How does the brain learn so much so quickly? (CCN 2017)”	Source
Yu et al. (2009)	Source
Yue et al. (2016)	Source
Yuste (2015)	Source
Zador (1998)	Source
Zador (1999)	Source
Zador (2019)	Source
Zaghloul and Boahen (2006)	Source
Zbili and Debanne (2019)	Source
Zbili et al. (2016)	Source
Zenke et al. (2017)	Source
Zhang et al. (2014)	Source
Zhang et al. (2019)	Source
Zhou et al. (2013)	Source
Zhu et al. (2012)	Source
Zilberter et al. (2005)	Source
Zuo et al. (2005)	Source
Zuo et al. (2015)	Source

Expand Footnotes Collapse Footnotes

1.The names “mechanistic method” and “functional method” were suggested by our technical advisor Dr. Dario Amodei, though he uses a somewhat more specific conception of the mechanistic method. Sandberg and Bostrom (2008) also distinguish between “straightforward multiplicate estimates” and those that are based on “analogy or constraints” (p. 84, Appendix A).

2.Here I am using “software” in a way that includes trained models in addition to hand-coded programs. Some forms of hardware (including neuromorphic hardware – see Mead (1989)) complicate traditional distinctions between hardware and software, but the broader consideration at stake here – e.g., that task-performance requires organizing available computational power in the right way – remains applicable.

3.Though it also seems easier, in general, to show that X is enough, than that X is strictly required – an asymmetry present throughout the report.

4.The probabilities reported here should be interpreted as subjective levels of confidence or “credences,” not as claims about objective frequencies, statistics, or “propensities” (see Peterson (2009), Chapter 7, for discussion of various alternative interpretations of probability judgments). One way of defining these credences is via preferences over lotteries – a definition I find useful (though not fully satisfactory). On such a definition, “I think it more likely than not” means that, for example, if I had the option to win $10,000 if 10¹⁵ FLOP/s is sufficient, in principle, to match human-level task-performance, or the option to win $10,000 if 10¹⁵ FLOP/s is not sufficient, I would choose the former option. Skepticism about my answer should go in proportion to confidence that 1e15 FLOP/s is not sufficient (e.g., those who disagree should prefer the latter option rather than the former), rather than with dissatisfaction with the evidence available either way (I too am quite dissatisfied in this regard), or disinclination to take real-world bets (why turn down a free chance at $10,000?). That said, for various reasons, I don’t find this definition of subjective probability judgments fully satisfactory (in particular, it transforms probabilistic claims about the world into true/false claims about one’s betting behavior– and it’s not clear exactly what sort of betting behavior is implied, or what consistency in such behavior assumed), so I offer it more as a gesture at a way of soliciting subjective credences than as an endorsed definition. See Peterson (2009), section 7.5, for discussion of lotteries of this type in the context of the literature on decision-theory. See also this blog post by Andrew Critch for more informal discussion; and see Muehlhauser (2017a), section 2, for discussion of some complexities involved in using these probabilities in practice.

5.I focus on this model in particular because I think it fits best with the mechanistic method evidence I’ve thought about most and take most seriously. Offering specific probabilities keyed to the minimum FLOP/s sufficient for task-performance, by contrast, requires answering further questions about the theoretical limits of algorithmic efficiency that I haven’t investigated.

6.See here for V100 prices (currently ~$8,799); and here the $1 billion Fugaku pricetag: “The six-year budget for the system and related technology development totaled about $1 billion, compared with the $600 million price tags for the biggest planned U.S. systems.” Fugaku FLOP/s performance is listed here, at around ~4×10¹⁷ FLOP/s-5×10¹⁷ FLOP/s. Google’s TPU supercomputer, which recently broke records in training ML systems, can also do ~4×10¹⁷ FLOP/s, though I’m not sure the costs. See Kumar (2020): “In total, this system delivers over 430 PFLOPs of peak performance.” The A100, for ~$200,000, can do 5×10¹⁵ FLOP/s – see Mehar (2020). NVIDIA’s newest SuperPOD can deliver ~7×10¹⁷ of AI performance – see Paikeday (2020).

7.See discussion in Section 1.3 below.

8.Selection effects include: expertise related to an issue relevant to the report, willingness to talk with me about the subject, recommendation by one of the other experts I spoke with as a possible source of helpful input, and connection (sometimes a few steps removed) with the professional and social communities that intersect at Open Philanthropy.

9.See Poldrack et al. (2017); Vul and Pashler (2017); Uttal (2012); Button et al. (2013); Szucs and P.A. loannidis (2017); and Carp (2012). And see also Muehlhauser (2017b), Appendix Z.8, for discussion of his reasons for default skepticism of published studies. My thanks to Luke Muehlhauser for suggesting this type of consideration and these references.

10.This effort is itself part of a project at Open Philanthropy currently called Worldview Investigations, which aims to investigate key questions informing our grant-making.

11.See, for example, Moravec (1998), chapter 2; and Kurzweil (2005), chapter 3. See this list from AI Impacts for related forecasts.

12.See, for example, Malcolm (2000); Lanier (2000) (“Belief # 5”); Russell (2019) (p. 78). AI Impacts offers a framework that I find helpful, which uses indifference curves indicating which combinations hardware and software capability yield the same overall task-performance. This framework (see especially Figure 3) makes clear that the first human-level AI systems could use much more or much less hardware than the amount “equivalent” to the human brain (at least assuming that this amount is not the absolute minimum) – though see figure 4 for a scenario in which brain-equivalent hardware is a better basis for forecasts.

13.Moravec argues here that “under current circumstances, I think computer power is the pacing factor for AI” (see his second reply to Robin Hanson). Kurzweil (2005) devotes all of Chapter 4 to the question of software.

14.For example: a ResNet-152 uses ~1e10 FLOP to classify an image, but took ~1e19 FLOP (a billion times more) to train, according to Hernandez and Amodei (2018) (see appendix, though see also Hernandez and Brown (2020) for discussion of decreasing training costs for vision models over time).

15.Silver et al. (2017): “Over the course of training, 4.9 million games of self-play were generated” (see “Empirical analysis of AlphaGo Zero training”). A bigger version of the model trained on 29 million games. See Kaplan et al. (2020) and Hestness et al. (2017) for more on the scaling properties for training in deep learning.

16.The question of what sorts of task-performance will result from what sorts of training is centrally important in this context, and I am not here assuming any particular answers to it.

17.The fact that training a model requires running it a lot makes this clear. But there are also more complex relationships between e.g. model size and amount of training data. See Kaplan et al. (2020) and Hestness et al. (2017).

18.See e.g. Dongerra et al. (2003): “the performance of a computer is a complicated issue, a function of many interrelated quantities. These quantities include the application, the algorithm, the size of the problem, the high-level language, the implementation, the human level of effort used to optimize the program, the compiler’s ability to optimize, the age of the compiler, the operating system, the architecture of the computer and the hardware characteristics” (p. 805); Moravec (1988): “Any particular formula for estimating power may be grossly misled by an unlucky or diabolic counterexample. For instance, if a computer’s power were defined simply by how many additions per second it could do, an otherwise useless special circuit made of an array of fast adders, and nothing else, costing a few hundred dollars, could outperform a $10-million supercomputer” (p. 169); Nordhaus (2001): “Measuring computer power has bedeviled analysts because computer characteristics are multidimensional and evolve rapidly over time.” (p. 5).

19.An operation, here, is an abstract mapping from inputs to outputs that can be implemented by a computer, and that is treated as basic for the purpose of the analysis in question (see Schneider and Gersting (2018) (p. 96-100)). A FLOP is itself composed out of a series of much simpler logic operations, which are in some contexts a more natural and basic computational unit. See e.g. Sipser (2013), section 9.3, for discussion of analyzing the complexity of algorithms in terms of the number of AND, OR, and NOT gates required to construct a functional circuit. The report’s analysis could in principle be converted into these units instead – or, indeed, into any computational unit capable of simulating a FLOP.

20.See e.g. Kahn and Mann (2020): “The success of modern AI techniques relies on computation on a scale unimaginable even a few years ago. Training a leading AI algorithm can require a month of computing time and cost $100 million” (p. 3); and Geoffrey Hinton’s comments in Lee (2016): “In deep learning, the algorithms we use now are versions of the algorithms we were developing in the 1980s, the 1990s. People were very optimistic about them, but it turns out they didn’t work too well. Now we know the reason is they didn’t work too well is that we didn’t have powerful enough computers, we didn’t have enough data sets to train them. If we want to approach the level of the human brain, we need much more computation, we need better hardware.” For more discussion of the compute burdens of contemporary AI applications, see e.g. Kaplan et al. (2020), Amodei and Hernandez (2018), and McCandlish et al. (2018). Note that the dominant costs here are from training the relevant systems, not from running them. However, the costs of training depend centrally on the costs of running (along with other factors). This relationship is central to my colleague Ajeya Cotra’s investigation.

21.I say a little bit about communication bandwidth in Section 5. See Sandberg and Bostrom (2008) (p. 84-85), for a literature review of memory estimates. See Open Philanthropy’s non-verbatim notes from a conversation with Dr. Adam Marblestone] (“FLOP/s”) for some discussion of other relevant factors.

22.Eugene Izhikevich, for example, reports that in running his brain simulation, he did not have the memory required to store all of the synaptic weights (10,000 terabytes), and so had to regenerate the anatomy of his simulated brain every time step; and Stephen Larson suggested that one of the motivations behind the Blue Brain project’s reliance on a supercomputer was the need to reduce latency between computation units (see Open Philanthropy’s non-verbatim notes from a conversation with Dr. Stephen Larson (p. 5)). See also Fathom Computing’s comment here: “Data movements, not math or logic operations, are the bottleneck in computing” (though this is hardly an unbiased source); Hollemans’ comments here: “The number of computations — whether you count them as MACCs or FLOPS — is only part of the story. Memory bandwidth is the other part, and most of the time is even more important!”; and various citations from AI Impacts, e.g. Angel et al. (2012), and Takahashi (2012).

23.From Open Philanthropy’s non-verbatim notes from a conversation with Dr. Adam Marblestone: “the architecture of a given computer (especially e.g. a standard von Neumann architecture) might create significant overhead. For example, the actual brain co-locates long-term memory and computing. If you had to store longer-term data in a conventional RAM instead, many additional operations might be necessary in order to locate, address, and update relevant variables” (p. 1). One option for reducing overheads might involve neuromorphic computing architectures (see Mead (1989), descriptions here, and papers here; Zaghloul and Boahen (2006) report a “100-fold improvement over conventional microprocessors” (p. 266)). There is also a growing industry of chips designed specifically for AI applications (see Khan (2020): “AI-specialized chip designs are an additional 10 to 1,000 more cost-effective for training AI algorithms than ordinary chips” (p. 2)).

24.An example of “unrealistically extreme abundance” would be the type of abundance of memory required by a giant look-up table. Even bracketing such obviously extreme scenarios, though, it seems possible that trade-offs between FLOP/s and other computational resources might complicate talk about the minimum FLOP/s sufficient to do X, absent further more specific constraints on the other resources available. I haven’t delved into this issue much: my hope is that insofar as it’s a problem in theory, the actual evidence surveyed in the report will still be useful in practice.

25.See Ananthanarayanan et al. (2009) for discussion of the hardware complexities involved in brain simulation.

26.Objections focused on general differences between brains and various human-engineered computers (e.g., the brain lacks a standardized clock, the brain is very parallel, the brain is analog, the brain is stochastic, the brain is chaotic, the brain is embodied, the brain’s memory works differently, the brain lacks a sharp distinction between hardware and software, etc.) are therefore relevant only insofar as they are incompatible with particular claims in the report; they are not, as far as I can tell, incompatible with any underlying assumptions of the project as a whole (except insofar as they are taken to suggest that no human-engineered computer can perform the tasks the brain performs – a form of skepticism the report does not attempt to address). See Marcus (2015) for discussion of some such objections. The different methods I consider rely on their own, somewhat more substantive assumptions.

27.My impression is that the content reviewed here is basically settled science, though see Section 1.5.1 for discussion of various types of ongoing neuroscientific uncertainty.

28.Azevedo et al. (2009): “We find that the adult male human brain contains on average 86.1 ± 8.1 billion NeuN-positive cells (“neurons”) and 84.6 ± 9.8 billion NeuN-negative (“nonneuronal”) cells” (532). My understanding is that the best available method of counting neurons is isotropic fractionation, which proceeds by dissolving brain structures into a kind of homogenous “brain soup,” and then counting cell nuclei (see Herculano-Houzel and Lent (2005) for a more technical description of the process, and Bartheld et al. (2016) for a history of cell-counting in the brain). Note that there may be substantial variation in cell counts between individuals (for example, according to Bartheld et al. (2016) (p. 9), citing Haug (1986) and Pakkenberg and Gundersen (1997), neocortical neuron count may vary by a factor of more than two, though I haven’t checked these further citations). At one point it was widely thought that the ratio of glial cells (a type of non-neuronal cell) to neurons in the brain was 10:1, but this is wrong (see Bartheld et al. (2016)).

29.I do not have a rigorous definition of “signaling” between cells, though there may be one available. A central example would be when one cell has a specialized mechanism for sending out a particular type of chemical to another cell, which in turn has a specialized receptor for receiving that chemical. See Lodish et al. (2008), ch. 15 and 16, for lengthy discussion of biological signaling mechanisms. For examples of signaling by non-neuronal cells, see the section on glia. Jess Riedel suggested a definition on which the functionally-structured impact of one cell on another counts as signaling if the impact on the second cell varies based on the state of the first (as opposed to, e.g., one cell sending the other one resources irrespective of the first cell’s state) – a case in which the impact on the second cell provides information about the state of the first (see Open Philanthropy’s non-verbatim notes from a conversation with Dr. Jess Riedel, p. 5).

30.The texts I have engaged with in cognitive science and neuroscience do not attempt to give necessary and sufficient conditions for a physical system to count as “processing information,” and I will not attempt a rigorous definition here (see Piccinini and Scarantino (2011) for an attempt to disambiguate and evaluate a few possible interpretations, based on different possible conceptions of the relevant type of “information”). My impression, though, is that the intuitive notion is roughly as follows. The brain’s activity makes what you do sensitive to sensory input, past and present (someone throws a shoe at your head, and you duck; you see an old friend at a coffee shop, and you stop to chat). Such sensitivity requires that when the brain receives one set of sensory inputs, rather than another, this difference is reflected somehow in the state of the nervous system in a manner available, at least initially, to make a reliable difference between one macroscopically-specified behavioral response or another (though lots of information is quickly discarded). In this sense, the brain takes in or “encodes” information about sensory inputs using different biophysical variables (that is, aspects of the biophysical system that can be in different states). The brain then processes this information in the sense that the states of these variables serve as inputs to further causal processes in the brain which combine to create behavioral sensitivity to high-level properties of an organism’s environment and history. Thus, for example, if you want to set up a brain that causes an organism to run from a tiger, but not from a tree, you need to have more than a set of biophysical variables that correlate with the type of light hitting different parts of the eye – you also need causal processes that “extract” from that light an answer to the question “is this a tiger or a tree?”, and then cause the relevant behavioral response. For more discussion in this vein, see e.g. London and Häusser (2005) (p. 209); Koch (1999) (p. 1); Hanson (2016) (p. 50); and Marr (1982) (p. 3). See this video for a vivid illustration of feature extraction; and this video for a nice example of neural information-processing.

31.See the “anatomy of a neuron” section here for quick description. See Kandel et al. (2013), ch. 4-8, Lodish et al. (2008), ch. 23, and this series of videos, for detailed descriptions of basic neuron structure and function.

32.Neurons can also synapse onto blood vessels, muscle cells, neuron cell bodies, axons, and axon terminals (at least according to the medical gallery of Blausen Medical 2014), but for simplicity, I will focus on synapses between axon terminals and dendrites in what follows.

33.See Siegelbaum and Koester (2013a): “In addition to ion channels, nerve cells contain a second important class of proteins specialized for moving ions across cell membranes, the ion transporters or pumps. These proteins do not participate in rapid neuronal signaling but rather are important for establishing and maintaining the concentration gradients of physiologically important ions between the inside and outside of the cell” (p. 100). See also the section on “Where does the resting membrane potential come from?” here.

34.See Siegelbaum and Koester (2013c) (p. 126-147); and the section “Where does the resting membrane potential come from?” here.

35.See Siegelbaum and Koester (2013a) (p. 100-124), for detailed description of ion channel dynamics.

36.See Kandel et al. (2013) (p. 31-35); and Siegelbaum and Koester (2013b) (p. 148-171), for description. See also here.

37.See Siegelbaum and Koester (2013d) (p. 184-187); Siegelbaum et al. (2013c) (p. 260-287); and description here in the section “overview of transmission at chemical synapses”). See also Lodish et al. (2008) (p. 1020). Note that action potentials do not always trigger synaptic transmission: see section 2.1.1.2.2.

38.I’ll refer to the event of a spike arriving at a synapse as a “spike through synapse.” A network of interacting neurons is sometimes called a neural circuit. A series of spikes from a single neuron is sometimes called a spike train. From Khan Academy: “we can divide the receptor proteins that are activated by neurotransmitters into two broad classes: Ligand-activated ion channels: These receptors are membrane-spanning ion channel proteins that open directly in response to ligand binding. Metabotropic receptors: These receptors are not themselves ion channels. Neurotransmitter binding triggers a signaling pathway, which may indirectly open or close channels (or have some other effect entirely)” (see section “Two types of neurotransmitter receptors”). See Siegelbaum et al. (2013) (p. 210-235), for more on the first class of receptors; and Siegelbaum et al. (2013b) (p. 236-255), for more on the second.

39.This particular picture appears to show one neuron synapsing onto the cell body of another, as opposed to the dendrites. But dendrites are generally taken to be the main receivers of synaptic signals.

40.See Open Philanthropy’s non-verbatim notes from a conversation with Prof. Shaul Druckmann: “Setting aside plasticity, most people assume that modeling the immediate impact of a pre-synaptic spike on the post-synaptic neuron is fairly simple. Specifically, you can use a single synaptic weight, which reflects the size of the impact of a spike through that synapse on the post-synaptic membrane potential” (p. 1). Lahiri and Ganguli (2013) note that the theoretical models often treat synapses as “described solely by a single scalar value denoting the size of a post-synaptic potential” (p. 1), though they do not endorse this.

41.See discussion and citations in Section 2.2 for more details.

42.Cudmore and Desai (2008): “Intrinsic plasticity is the persistent modification of a neuron’s intrinsic electrical properties by neuronal or synaptic activity. It is mediated by changes in the expression level or biophysical properties of ion channels in the membrane, and can affect such diverse processes as synaptic integration, subthreshold signal propagation, spike generation, spike backpropagation, and meta-plasticity” (opening section).

43.See e.g. Munno and Syed (2003), Ming and Song (2011), Grutzendler et al. (2002), Holtmaat et al. (2005).

44.See Schwartz and Javitch (2013), (p. 297-301); Russo (2017); and Leng and Ludwig (2008): “Neurones use many different molecules to communicate with each other, acting in many different ways via specific receptors. Amongst these molecules are more than a hundred different peptides, expressed in different subpopulations of neurons, and many of these peptides are known for the distinctive effects on specific physiological functions that follow central administration of peptide agonists or antagonists.” (p. 5625). See also Mains and Eipper (1999).

45.Burrows (1996): “A neuromodulator is a messenger released from a neuron in the central nervous system, or in the periphery, that affects groups of neurons, or effector cells that have the appropriate receptors. It may not be released at synaptic sites, often acts through second messengers and can produce long-lasting effects. The release may be local so that only nearby neurons or effectors are influenced, or may be more widespread, which means that the distinction with a neurohormone can become very blurred. The act of neuromodulation, unlike that of neurotransmission, does not necessarily carry excitation of inhibition from one neuron to another, but instead alters either the cellular or synaptic properties of certain neurons so that neurotransmission between them is changed” (p. 195).

46.Araque and Navarrete (2010) (p. 2375); Bullock et al. (2005), (p. 792); Mu et al. (2019); and the rest of the discussion in Section 2.3.2.

47.See e.g. Anastassiou et al. (2011) and Chang (2019), along with the other citations in Section 2.3.4.

48.See Bullock et al. (2005), describing the history of early neuroscience: “physiological studies established that conduction of electrical activity along the neuronal axon involved brief, all-or-nothing, propagated changes in membrane potential called action potentials. It was thus often assumed that neuronal activity was correspondingly all-or-nothing and that action potentials spread over all parts of a neuron. The neuron was regarded as a single functional unit: It either was active and “firing” or was not” (p. 791).

49.See Zbili and Debanne (2019) for a review, together with the other citations in Section 2.3.5.

50.See Moore and Cao (2008): “we propose that hemodynamics also play a role in information processing through modulation of neural activity… We predict that hemodynamics alter the gain of local cortical circuits, modulating the detection and discrimination of sensory stimuli. This novel view of information processing—that includes hemodynamics as an active and significant participant— has implications for understanding neural representation and the construction of accurate brain models” (p. 2035).

51.A few others I am not discussing include: quantum dynamics (see endnote in section 1.6), the perineuronal net (see Tsien (2013) for discussion), and classical dynamics in microtubules (see Cantero et al. (2018)). I am leaving quantum dynamics aside mostly for the reasons listed in the endnote in section 1.6. I leave out the other two mechanisms partly because of time constraints, and partly because my impression is that they do not feature very prominently in the discourse on this topic. I bucket all the possible alternative mechanisms I am not discussing under the uncertainties discussed in Section 2.3.7.

52.A few representative summaries: Marcus (2015): “Neuroscience today is collection of facts, rather than ideas; what is missing is connective tissue. We know (or think we know) roughly what neurons do, and that they communicate with one another, but not what they are communicating. We know the identities of many of the molecules inside individual neurons and what they do. We know from neuroanatomy that there are many repeated structures (motifs) throughout the neocortex. Yet we know almost nothing about what those motifs are for, or how they work together to support complex real-world behavior. The truth is that we are still at a loss to explain how the brain does all but the most elementary things. We simply do not understand how the pieces fit together” (p. 205): Einevoll et al. (2015): “Despite decades of intense research efforts investigating the brain at the molecular, cell, circuit and system levels, the operating principles of the human brain, or any brain, remain largely unknown… At present we do not have any well-grounded, and certainly not generally accepted, theory about how networks of millions or billions of neurons work together to provide the salient brain functions in animals or humans. We do not even have a well-established model for how neurons in primary visual cortex of mammals work together to form the intriguing neuronal representations with, for example, orientation selectivity and direction selectivity that were discovered by Hubel and Wiesel sixty years ago (Hubel and Wiesel (1959)).” (p. 2, and p. 8).

53.See especially Open Philanthropy’s non-verbatim notes from a conversation with Prof. Eric Jonas, Prof. Shaul Druckmann, Prof. Erik De Schutter, Prof. Konrad Kording; Prof. Eve Marder; Dr. Adam Marblestone; and Dr. Stephen Larson.

54.Kleinfield et al. (2019), (p. 1005), for description of various techniques and their limitations. See also Marblestone et al. (2013): “Simultaneously measuring the activities of all neurons in a mammalian brain at millisecond resolution is a challenge beyond the limits of existing techniques in neuroscience… Based on this analysis, all existing approaches require orders of magnitude improvement in key parameters” (p. 1); and Adam (2019): “A technology that simultaneously records membrane potential from multiple neurons in behaving animals will have a transformative effect on neuroscience research” (p. 413), a quote which suggests that at the least, such a technology is at the cutting edge of what’s available (the paper appears to describe progress on this front). Stevenson and Kording (2011) found that “the number of simultaneously recorded single neurons has been growing rapidly, doubling approximately every 7 years. The trend described here predicts that in 15 years physiologists should be able to record from approximately 1,000 neurons” (p. 141). Their data shows that as of 2010, the maximum was a few hundred, though I’m not sure where it is now (see p. 140).

55.From Open Philanthropy’s non-verbatim notes from a conversation with Prof. Erik De Schutter: “At this point, we have no way to reliably measure the input-output transformation of a neuron, where the input is defined as a specific spatio-temporal pattern of synaptic input. You can build models and test their input-output mappings, but you don’t really know how accurate these models are… In live imaging, it’s very difficult to see what’s happening at synapses. Some people do calcium imaging of pre-synaptic terminals, but this is only for one part of the overall synaptic input (and it may create artefacts). Currently, you cannot get a global picture of all the synaptic inputs to a single neuron. You can’t stain all the inputs, and for a big neuron you wouldn’t be able to image the whole relevant volume of space… you don’t actually know what the physiological pattern of inputs is.” See also Ujfalussy et al. (2018): “Our understanding of neuronal input integration remains limited because it is either based on data from in vitro experiments, studying neurons under highly simplified input conditions, or on in vivo approaches in which synaptic inputs were not observed or controlled, and thus a systematic characterization of the input-output transformation of neurons was not possible” (2018); and notes from Open Philanthropy’s non-verbatim notes from a conversation with Prof. Shaul Druckmann: “It is very difficult to tell what spatio-temporal patterns of inputs are actually arriving at a neuron’s synapses in vivo. You can use imaging techniques, but this is very messy” (p. 2)

56.From Open Philanthropy’s non-verbatim notes from a conversation with Prof. Erik De Schutter: “Using glutamate uncaging, you can reliably activate single dendritic spines in vitro, and you can even do this in a sequence of spines, thereby generating patterns of synaptic input. However, even these patterns are limited. For example, you can’t actually activate synapses simultaneously, because your laser beam needs to move; there’s only so much you can do in a certain timeframe; and because it’s glutamate, you can only activate excitatory neurons” (p. 2). From Open Philanthropy’s non-verbatim notes from a conversation with Prof. Shaul Druckmann:”it is very difficult to tell how a neuron responds to arbitrary patterns of synaptic input. You can stimulate a pre-synaptic neuron and observe the response, but you can’t stimulate all pre-synaptic neurons in different combinations. And you can only patch-clamp one dendrite while also patch-clamping the soma (and this already requires world-class skill)” (p. 2).

57.From Open Philanthropy’s non-verbatim notes from a conversation with Prof. Shaul Druckmann: “Technology for measuring the properties relevant to detailed biophysical modeling has improved very little in the past 20 years … Neurons can have a few dozen of some 200-300 types of ions channels, which are strongly non-linear, with large effects, and which are spread out across the neuron. These cannot be modeled based on recordings of neuron spiking activity alone, and staining neurons for these ion channels is very difficult” (p. 2). And from Open Philanthropy’s non-verbatim notes from a conversation with Prof. Konrad Kording: “current techniques are very bad at measuring ion channel plasticity. Neuroscientists don’t tend to focus on it for this reason” (p. 5).

58.From Open Philanthropy’s non-verbatim notes from a conversation with Prof. Eric Jonas: “a lot of our animal models are wrong in clinically-relevant ways” (p. 5). And from Open Philanthropy’s non-verbatim notes from a conversation with Prof. E.J. Chichilnisky: “There is variability in retinal function both across species and between individuals of the same species. Mouse retinas are very different from human retinas (a difference that is often ignored), and there is variability amongst monkey retinas as well” (p. 3).

59.For example, spike-timing dependent plasticity – a form of synaptic plasticity – can be reliably elicited in vitro (see Open Philanthropy’s non-verbatim notes from a conversation with Prof. Eric Jonas (p. 3)), but Schulz argues that “Direct evidence for STDP in vivo is limited and suffers from the fact that the used protocols significantly deviate, more often than not, from the traditional pairing of single pre- and postsynaptic spikes (Shulz and Jacob (2010)). Thus, many studies use long-lasting large-amplitude postsynaptic potentials (PSP), and pairing usually involves multiple postsynaptic spikes or high repetition rates. Our own experience from cortico-striatal synaptic plasticity experiments indicates that classic STDP may be less effective in vivo than commonly expected (Schulz et al., 2010)” (p. 1).

60.From Open Philanthropy’s non-verbatim notes from a conversation with Prof. Shaul Druckmann: “The tasks that neuroscientists tend to study in model animals are very simple. Many, for example, are some variant on a two-alternative forced choice task (e.g., teaching an animal
to act differently, depending on which of two stimuli it receives). This task is extremely easy to model, both with a small number of highly simplified neurons, and with models that do not look like neurons at all. In this sense, tasks like these provide very little evidence about what level of modeling detail is necessary for reproducing more interesting behavior.” (p. 2). And from Open Philanthropy’s non-verbatim notes from a conversation with Prof. Eric Jonas: “In an experiment with a model animal like a rat, which has a very complicated brain, the number of input/output bits we can control/observe is extremely small. This makes it very hard to do informative, high-throughput experiments. Even if you had a billion rats doing your experiment 24/7, you’d still only have a small number of bits going in and out” (p. 2).

61.From Open Philanthropy’s non-verbatim notes from a conversation with Dr. Adam Marblestone: “Neuroscience is extremely limited by available tools. For example, we have the concept of a post-synaptic potential because we can patch-clamp the post-synaptic neuron and see a change in voltage. When we become able to see every individual dendritic spine, we might see that each has a different response; or when we become able to see molecules, we might see faster state transitions, more interesting spatial organization, or more complicated logic at the synapses. We don’t really know, because we haven’t been able to measure” (p. 9).

62.From Open Philanthropy’s non-verbatim notes from a conversation with Prof. Konrad Kording: “current techniques are very bad at measuring ion channel plasticity. Neuroscientists don’t tend to focus on it for this reason” (p. 5). From Open Philanthropy’s non-verbatim notes from a conversation with Prof. Shaul Druckmann: “The history of neuroscience sometimes seems like a process in which even though some process or level of detail is important, if it is very difficult to understand it, the community often shifts away from that level, and moves on to another level.. … he thinks that people don’t do detailed modeling because these models are ill-constrained at the current level of data that can be collected and it would require major investment to get the relevant data.” (p. 7).

63.Jonas and Kording (2017): “There is a popular belief in neuroscience that we are primarily data limited…here we take a classical microprocessor as a model organism, and use our ability to perform arbitrary experiments on it to see if popular data analysis methods from neuroscience can elucidate the way it processes information. Microprocessors are among those artificial information processing systems that are both complex and that we understand at all levels, from the overall logical flow, via logical gates, to the dynamics of transistors. We show that the approaches reveal interesting structure in the data but do not meaningfully describe the hierarchy of information processing in the microprocessor. This suggests current analytic approaches in neuroscience may fall short of producing meaningful understanding of neural systems, regardless of the amount of data” (p. 1). Though see also Merel et al. (2020) (p. 2), who use a virtual rodent as a model system, and who take a more optimistic view.

64.See e.g. Lillicrap and Kording (2019): “…We can have a complete description of the network and its computations. And yet, neither we, nor anyone we know feels that they grasp how processing in these networks truly works. Said another way, besides gesturing to a network’s weights and elementary operations, we cannot say how it classifies an image as a cat or a dog, or how it chooses one Go move over another” (p. 1). That said, research on this topic is just getting underway, and some participants are optimistic. See e.g. Olah et al. (2020a): “thousands of hours of studying individual neurons have led us to believe the typical case is that neurons (or in some cases, other directions in the vector space of neuron activations) are understandable… our experience is that there’s usually a simple explanation behind these neurons, and that they’re actually doing something quite natural” (see “Claim 1: Features” and “Claim 2: Circuits”). Some of this work focuses on the type of feature detection that neuroscience already has some preliminary handle on, but efforts to explore the interpretability of other types of models are underway as well (see Greydanus (2017), Such et al. (2018), Rupprecht et al. (2019), here and OpenAI et al. (2019) (p. 30-35), for examples). Personally, I would not be at all surprised if this work ends up quite neuroscientifically informative.

65.From Open Philanthropy’s non-verbatim notes from a conversation with Prof. Eve Marder: “It’s been hard to make progress in understanding neural circuits, because in order to know what details matter, you have to know what the circuit is doing, and in most parts of the brain, we don’t know this…It’s not that you can’t make simplifying assumptions. It’s that absent knowledge of what a piece of nervous system needs to be able to do, you have no way of assessing whether you’ve lost something fundamental or not” (p. 4); from Open Philanthropy’s non-verbatim notes from a conversation with Prof. Eric Jonas: “One level of uncertainty comes from the difficulty of defining the high-level task that neural systems are trying to perform (e.g., the “computational level” in the hierarchy proposed by David Marr). Our attempts to capture cognitive tasks with objective functions we can fit machine learning models to are all extreme simplifications. For example, Prof. Jonas is fairly confident that the visual system is not classifying objects into one of k categories” (p. 1); and the notes from Open Philanthropy’s non-verbatim notes from a conversation with Prof. E.J. Chichilnisky: “It’s hard to know when to stop fine-tuning the details of your model. A given model may be inaccurate to some extent, but we don’t know whether a given inaccuracy matters, or whether a human wouldn’t be able to tell the difference (though focusing on creating usable retinal prostheses can help with this)” (p. 3).

66.Dr. Stephen Larson suggested that one benefit of successfully simulating a simple nervous system would be that you could then bound the complexity necessary for such a simulation, and proceed with attempting to simplify it in a principled way (see Open Philanthropy’s non-verbatim notes from a conversation with Dr. Stephen Larson, p. 2). Prof. Shaul Druckmann (see here, p. 6) and Prof. Erik De Schutter appeared sympathetic to a similar research program. From Open Philanthropy’s non-verbatim notes from a conversation with Prof. Erik De Schutter:”The best way forward is to try to explore and understand the function of the brain’s underlying mechanisms – a project that may eventually lead to an understanding of what can be simplified. But to try to simplify things too early, before you understand them, is a dangerous game” (p. 1). Exactly what level of modeling success has been achieved by brain simulations as yet is a complicated issue, but many appear to lack any capacity for complex task-performance (Eliasmith et al. (2012) is one exception; see Open Philanthropy’s non-verbatim notes from a conversation with Prof. Chris Eliasmith for some discussion). Example brain simulations include: Arkhipov et al. (2018), Bileh et al. (2020), Markram et al. (2015); Izhikevich and Edelman (2007); Ananthanarayanan et al. (2009), Howell et al. (2000), Medina et al. (2000), McLaughlin (2000). See Garis et al. (2010) and Sandberg and Bostrom (2008) for surveys.

67.See White et al. (1984). See Jabr (2012b)for some history, as well as Seung (2012): “Mapping the C. elegans nervous system took over a dozen years, though it contains only 7,000 connections” (“Introduction”).

68.From Open Philanthropy’s non-verbatim notes from a conversation with Dr. Stephen Larson, who works on the OpenWorm project: “Despite its small size, we do not yet have a model that captures even 50% of the biological behavior of the C. elegans nervous system. This is partly because we’re just getting to the point of being able to measure what the worm’s nervous system is doing well enough. It is possible to replicate certain kinds of worm behaviors, such as a crawling forward motion, using a very simple neural network. However, the same model cannot be used to make the worm shift into crawling backwards. Rather, you have to re-train it, and even then, you don’t know if the model makes the decision to crawl backward with the same frequency, and for the same reasons, that the real worm does. In general, evolution has equipped the worm to respond to a very wide range of conditions, and the worm’s biology has all of these intricate and complex mechanisms that could potentially be involved in the behaviors you care about” (p. 1). David Dalrymple, who used to work on emulating C. elegans, writes: “Contrary to popular belief, connectomes are not the biological equivalent of circuit schematics. Connectomes are the biological equivalent of what you’d get if you removed all the component symbols from a circuit schematic and left only the wires… What you actually need is to functionally characterize the system’s dynamics by performing thousands of perturbations to individual neurons and recording the results on the network, in a fast feedback loop with a very very good statistical modeling framework which decides what perturbation to try next.” Sarma et al. (2018), in an overview of OpenWorm’s progress, write: “The level of detail that we have incorporated to date is inadequate for biological research. A key remaining component is to complete the curation and parameter extraction of Hodgkin–Huxley models for ion channels to produce realistic dynamics in neurons and muscles” (Section 3). Merel et al. (2020) create a “virtual rodent,” but this is not a bottom up emulation of a rodent brain.

69.Example approaches in this vein include Prof. Markus Meister, see Open Philanthropy’s non-verbatim notes from a conversation with Prof. Markus Meister: “It is theoretically possible that the brain’s task-performance draws on complex chemical computations, implemented by protein circuits, that would require models much more complicated than those that have been successful in the retina. But Prof. Meister’s approach is to ask: is there any evidence that forces us to think in this more complicated way? That is, he starts with the simplest possible explanation of the phenomena, and then adds to this explanation when necessary. Some neuroscientists take a different approach. That is, they ask “what is the most complicated way that this thing could work?”, and then assume that nature is doing that” (p. 4); and from Open Philanthropy’s non-verbatim notes from a conversation with Prof. Chris Eliasmith: “Prof. Eliasmith’s general approach is to see what simple models are able to do, and to introduce additional complexity only when doing so becomes necessary. In his models, he has thus far been able to successfully replicate various types of high-level behavior, along with various types of neuro-physiological data, without recourse to highly complex neuron models – a result that he thinks substantially less likely in worlds where the brain’s performance on these tasks proceeds via biophysical mechanisms his models do not include. However, this doesn’t mean that we won’t discover contexts in which greater complexity is necessary. And we are very far away from being able to test what is required to capture high-level behavior on the scale of the full human brain” (p. 2).

70.From Open Philanthropy’s non-verbatim notes from a conversation with Dr. Stephen Larson: “the jury is still out on how much simplification is available, and Dr. Larson thinks that in this kind of uncertain context, you should focus on the worst-case, most conservative compute estimates as your default. This means worrying about all of the information-processing present in cell biology. In general, in studying complex biological mechanisms, Dr. Larson thinks that the burden of proof is on those who want to say that a given type of simplification is possible” (p. 2). From Open Philanthropy’s non-verbatim notes from a conversation with Prof. Erik De Schutter: “Many common simplifications do not have solid scientific foundations, and are more at the level of “the way we do things.” The best way forward is to try to explore and understand the function of the brain’s underlying mechanisms – a project that may eventually lead to an understanding of what can be simplified. But to try to simplify things too early, before you understand them, is a dangerous game … The brain was not engineered. Rather, it evolved, and evolution works by adding complexity, rather than by simplification. There are good reasons for this complexity. In order to evolve, you can’t have systems, at any level (proteins, channels, cells, brain regions), with unique functions. If you did, and a single mutation knocked out the function, the whole system would crash… Indeed, in general, many scientists who approach the brain from an engineering perspective end up on the wrong footing. Engineering is an appropriate paradigm for building AI systems, but if you want to understand the brain, you need to embrace the fact that it works because it is so complicated. Otherwise, it will be impossible to understand the system” (p. 1).

71.I will not attempt a definition of which tasks count as “cognitive,” but the category should be construed as excluding tasks that are intuitively particular to the brain’s biological substrate – for example, the task of implementing an input-output transformation that will serve as an effective means of predicting how the biological brain will respond to a certain kind of drug, or the task of serving as a good three-pound weight. LeCun and Bengio (2007) gesture at a somewhat similar subset of tasks, which they call the “AI-set”: “Among the set of all possible functions, we are particularly interested in a subset that contains all the tasks involved in intelligent behavior. Examples of such tasks include visual perception, auditory perception, planning, control, etc. The set does not just include specific visual perception tasks (e.g human face detection), but the set of all the tasks that an intelligent agent should be able to learn. In the following, we will call this set of functions the AI-set. Because we want to achieve AI, we prioritize those tasks that are in the AI-set” (p. 4-5). I am also excluding microscopically specified input-output relationships that an actual brain, operating in the type of noisy environments brains evolved in, cannot implement reliably.

72.See Grace et al. (2018) for discussion of a simple version of this task, which involves writing “concise, efficient, and human-readable Python code to implement simple algorithms like quicksort” (p. 19). The median estimate by the experts she surveyed for when AI systems will be able to perform this task was 8.2 years from the time of the survey. GPT-3, a language model released by OpenAI in 2020, is capable of at least some forms of coding (see here for an especially vivid demonstration, here for another example, and here for more discussion).

73.Depending on one’s opinions of the peer review process, perhaps it is debatable whether GPT-3 can do this as well. See here for examples. I chose both the “complex software problem” task and the “review a nature paper” task before the GPT-3 results came out, and they were selected to be tasks that we couldn’t yet do with AI systems.

74.See Grace et al. (2018) (p. 16), for discussion of a version of this task. The median estimate by the experts she surveyed for when AI systems will be able to perform this task was 33.8 years from the time of the survey.

75.It has been occasionally hypothesized that some form of quantum-level information processing is occuring in the brain (see, for example, Hu and Wu (2004), Penrose and Hameroff (2011), and Fisher (2015) for suggestions in this vein, and see Tegmark (1999) and Litt et al. (2006) for counterarguments). My understanding, though, is that the large majority of experts believe that the brain’s information-processing is purely classical. For example, Sandberg and Bostrom (2008) write that: “Practically all neuroscientists subscribe to the dogma that neural activity is a phenomenon that occurs on a classical scale” (37). My impression is that the most influential arguments against quantum computation have been in the vein of Tegmark (1999), who argues that the timescales of quantum decoherence in the brain (~10-13 to 10-20 seconds) are too short to play a role in various possible methods of neural information processing, which proceed on much longer timescales (~10-3 to 10-1 seconds) (p. 1). That said, there is at least some evidence that non-trivial quantum dynamics play a role in some biological contexts (e.g., photosynthesis, enzyme catalysis, and avian navigation) where arguments that appeal solely to the fact that a biological system is warm/wet/noisy might have ruled them out (my thanks to Prof. David Wallace for suggesting I address this): see, e.g., McFadden and Al-Khalili (2018) for a review. Indeed, Fisher (2015) presents his hypothesis about quantum dynamics in the brain as immune to timescale-based objections. However, my impression at a glance is that his research at this stage is mostly at the level of establishing the theoretical possibility of some form of quantum computation in the brain, as opposed to verifying that such computation is actually occuring. Thus, for example, in this 2019 talk (36:40), he comments: “What I’ve offered is a story at this stage, if you want it’s a partly formed picture puzzle, and what’s needed are experiments to discern the precise shapes of the various pieces in this puzzle, and to see whether they actually exist as pieces, what shapes they are, and whether they start fitting together.” In general, the possibility of quantum computation in the brain is a further category of uncertainty; but it’s an additional can of worms, and because the hypothesis appears to play a comparatively small role in mainstream neuroscience, I’m not going to address it in depth.

76.See Nicolesis and Circuel (2015), Lucas (1961), Dreyfus (1972) and Penrose (1994) for various forms of skepticism.

77.Note that F does not need to be enough to match the task-performance of a “superbrain” trained and ready to perform any task that any human can perform: e.g., a brain that represents peak human performance on every task simultaneously. Einstein may do physics that requires x FLOP/s, and Toni Morrison may write novels that require y FLOP/s, but F only needs to be greater than or equal to both x and y: it doesn’t need to be greater than or equal to x+y.

78.Herculano-Houzel (2009) reports variation in neuron number within a species at around 10-50%. Reardon et al. (2018) write: “Brain size among normal humans varies as much as twofold.” Koch (2016) cites numbers ranging from 1,017 grams to 2,021 grams (though these are for post-mortem measures), and from 975 cm³ to 1499 cm³.

79.From Open Philanthropy’s non-verbatim notes from a conversation with Dr. Paul Christiano: “If you include a sufficiently broad range of tasks that the human brain can perform, and require similarly useful task-performance across the full range of inputs to which the brain could be exposed, it is likely that for at least one of the tasks in the relevant profile, for some set of inputs, the brain’s method will (a) be close to maximally algorithmically efficient (e.g., within an order of magnitude or two), and (b) use a substantial portion of the computational resources that the brain has available. For example, if you take a computer from the 60s, and you look at all of the tasks it could perform, Dr. Christiano expects that many of the algorithms it was running (for example: sorting), were close to optimally efficient. As another example, there is a very inefficient algorithm for SAT solving, which takes 2ⁿ time. For many inputs, we can improve on this algorithm by a huge amount, but we can’t for every input: indeed, there is a rough consensus amongst computer scientists that the very inefficient algorithm is close to the best one can do. Indeed, Dr. Christiano expects that for most algorithms, there will be some family of instances on which it does reasonably well. And given how large the space of possible tasks the brain performs is (we can imagine a very wide set of evaluation metrics and input regimes), the density of roughly-optimal-on-some-inputs algorithms doesn’t need to be that high for them to appear in the brain” (p. 7).

80.It’s not entirely clear which concept Moravec and Kurzweil have in mind, but (1) has some support. See Moravec (1998): “How much further must this evolution proceed until our machines are powerful enough to approximate the human intellect?” (p. 52), and his reply to Anders Sandberg here: “It is the final computation that matters, not the fuss in doing it.” Kurzweil (2005): “if two methods achieve the same result but one uses more computation than the other, the more computationally intensive method will be considered to use only the amount of computation of the less intensive method” (p. 137).

81.See Sandberg and Bostrom (2008) (p. 11), for a taxonomy of possible brain-emulation success criteria. See Muehlhauser (2017) for an investigation at Open Philanthropy of consciousness and moral patienthood.

82.There is a fairly widespread discourse related to the importance of “embodiment” in AI and cognitive science more broadly, which I have not engaged with in depth. At a glance, central points seem to be: (a) that the computation a brain performs is importantly adapted to the physical environment in which it operates, and the representations it employs are constrained by the body that implements them (see e.g. Hoffmann and Pfeifer (2012), and the discussion of “Body as constraint” in Wilson and Foglia (2015)), (b) that the morphology of body itself can contribute to control, perception, and computation proper, and that not all information-processing or storage takes place “inside the head” (Müller and Hoffmann (2017), the discussion of “Body as distributor” in Wilson and Foglia (2015), the literature on the “extended mind”), (c) that the body functions to coordinate/regulate the relationship between cognition and action (see “Body as Regulator” in Wilson and Foglia (2015)), and (d) that advanced AI systems won’t be developed until we make it possible for them to learn via engagement in with real-time, complex environments, possibly via robotic bodies (see Medlock (2017); Prof. Anthony Zador also suggested something like this in conversation, see here). These points may well be true, but I do not think they disrupt the conceptual foundations of the present investigation, which aims to estimate the compute sufficient to replicate the brain’s contribution to (possibly embodied) task-performance. If points related to embodiment are thought to extend to the claim that e.g. artificial systems without bodies are incapable, in principle, of solving software problems, competing in Math competitions, or reviewing science papers, then I simply disagree.

83.This literature review draws from the reviews offered by Sandberg and Bostrom (2008) (p. 84-85); and Martins (2012), (p. 3-6). I have supplemented it with other estimates I encountered in my research. In order to limit its scope, I focus on direct attempts to estimate the computation sufficient to run a task-functional model.

84.The estimates that I think most worth taking seriously are generally the ones I discuss in the report itself.

85.Merkle (1989) attempts to estimate the number of spikes through synapses by estimating the energy dissipated by propagating a spike a certain distance, together with the number of synapses per unit distance, rather than counting spikes and synapses directly. He gets ~2e15 synaptic operations, assuming 1 synapse every millimeter, though it is unclear to me what grounds his estimate of synapses per unit distance: “To translate Ranvier ops (1-millimeter jumps) into synapse operations we must know the average distance between synapses, which is not normally given in neuroscience texts. We can estimate it: a human can recognize an image in about 100 milliseconds, which can take at most 100 one-millisecond synapse delays. A single signal probably travels 100 millimeters in that time (from the eye to the back of the brain, and then some). If it passes 100 synapses in 100 millimeters then it passes one synapse every millimeter–which means one synapse operation is about one Ranvier operation” (1989).

86.Merkle (1989): “We might count the number of synapses, guess their speed of operation, and determine synapse operations per second. There are roughly 10¹⁵ synapses operating at about 10 impulses/second, giving roughly 10¹⁶ synapse operations per second” (see “Other Estimates”).

87.Mead (1990): “There are about 10¹⁶ synapses in the brain. A nerve pulse arrives at each synapse about ten times/s, on average. So in rough numbers, the brain accomplishes 10¹⁶ complex operations/s” (p. 1629). Some aspect of this estimate appears to be in error, however, as it seems to suggest the calculation 10¹⁶ synapses × 10 spikes/sec = 10¹⁶ spikes per synapse/sec.

88.Freitas (1996): “A fair estimate is that the 1.5 kilogram organ has 10¹⁰ neurons with 10³ synapses firing an average 10 times per second, which is about 10¹⁴ bits/second. Using 64-bit words like the largest supercomputers, that’s about one teraflop” (see opening section).

89.Sarpeshkar (1997): “From the numbers in the first paragraph of Section 5.6.1, we know that there are about 2.4 × 10¹⁴ synapses in each cortex of the brain. The average firing rate of cortex is about 5-10 Hz – we shall use 7.5 Hz. Assuming that each synapse is always operational and constantly computing, then the number of synaptic operations per second is 2 × 2.4 × 10¹⁴ × 7.5 = 3.6 × 10¹⁵” (p. 202-203).

90.Bostrom (1998): “The human brain contains about 10¹¹ neurons. Each neuron has about 5 × 10³ synapses, and signals are transmitted along these synapses at an average frequency of about 10² Hz. Each signal contains, say, 5 bits. This equals 10¹⁷ ops” (see “Hardware Requirements” section).

91.Kurzweil (1999): “With an estimated average of one thousand connections between each neuron and its neighbors, we have about 100 trillion connections, each capable of a simultaneous calculation… With 100 trillion connections, each computing at 200 calculations per second, we get 20 million billion calculations per second. This is a conservatively high estimate; other estimates are lower by one to three orders of magnitude” (see Chapter 6, section “Achieving the Hardware Capacity of the Human Brain”).

92.Dix (2005): “At a simplified level each neuron’s level of activation is determined by pulses generated at the (1000 to 10,000) synapses connected to it. Some have a positive excitatory effect [sic] some are inhibitory. A crude model simply adds the weighted sum and ‘fires’ the neuron if the sum exceeds a value. The rate of this activity, the ‘clock period’ of the human brain is approximately 100 Hz – very slow compared to the GHz of even a home PC, but of course this happens simultaneously for all 10 billion neurons! If we think of the adding of the weighted synaptic value as a single neural operation (nuop) then each neuron has approximately 10,000 nuops per cycle, that is 1mega-nuop per second. In total the 10 billion neurons in the brain perform 10 peta-nuop per second.”

93.Malickas (2007): “The evaluation of the computational power of [sic] human brain [sic] very uncertain at this time. Some estimates of brain power could be based on the brain synapses number and neurons [sic] firing rate. The human brain have [sic] a 10¹¹ neurons and each neuron has [sic] average of 10² – 10⁴ synapses. The average firing rate of brain neurons is about 100-1000 Hz. As result the brain modeling would require the computational power of 10¹¹ neurons × (10²-10⁴ synapses/neuron) × (100-1000 Hz) = 10¹⁵ – 10¹⁸ synapses/second” (see section “Computer”).

94.Tegmark (2017): “Multiplying together about 10¹¹ neurons, about 104 connections per neuron and about one (100) firing per neuron each second might suggest that about 10¹⁵ FLOPS (1 petaFLOPS) suffice to simulate a human brain, but there are many poorly understood complications, including the detailed timing of firings and the question of whether small parts of neurons and synapses need to be simulated too” (see endnote 58, p. 340). That said, Tegmark presents this less as an independent estimate of his own, and more as an example of a certain methodology.

95.Sandberg and Bostrom (2008) also cite Fiala (2007) as estimating “10¹⁴ synapses, identity coded by 48 bits plus 2 × 36 bits for pre‐and postsynaptic neuron id, 1 byte states. 10 ms update time… 256,000 terabytes/s” (p. 85), and Seitz (no date) as estimate “50-200 billion neurons, 20,000 shared synapses per neuron with 256 distinguishable levels, 40 Hz firing” (p. 85). However, I wasn’t able to find the original papers on a quick search. Adams (2013) estimates ~1e15 FLOP/s in a blog post, but his estimate of neuron count is off by two orders of magnitude.

96.I haven’t investigated comparisons between these different units and FLOP/s (though see Sandberg and Bostrom (2008), p. 91, for some discussion of the relationship between FLOP/s and MIPS).

97.As I note in Section 2.1.1.1, many of these estimates rely on average spike rates that seem to me too high.

98.Sarpeshkar (2010): “The brain’s neuronal cells output ~1ms pulses (spikes) at an average rate of 5 Hz [55]. The 240 trillion synaptic connections [1] amongst the brain’s neurons thus lead to a computational rate of at least 10¹⁵ synaptic operations per second. A synapse implements multiplication and filtering operations on every spike and sophisticated learning operations over multiple spikes. If we assume that synaptic multiplication is at least one floating-point operation (FLOP), the 20 ms second-order filter impulse response due to each synapse is 40 FLOPS, and that synaptic learning requires at least 10 FLOPS per spike, a synapse implements at least 50 FLOPS of computation per spike. The nonlinear adaptation-and- thresholding computations in the somatic regions of a neuron implement almost 1200 floating-point operations (FLOPS) per spike [66]. Thus, the brain is performing at least 50 FLOPS × 5Hz × 240 × 10¹² + 1200 FLOPS × 5Hz × 22 × 10⁹ = [approximate] 6 × 10¹⁶ FLOPS per second” (p. 748-749).

99.Martins et al. (2012): “These data may be combined using Eqns. (1) and (2) to yield an estimate of the synaptic-processed spike rate of Tss = (4.31 ± 0.86) × 10¹⁵ spikes/sec and the synaptic-processed bit rate of Tsb = (5.52 ± 1.13) × 10¹⁶ bits/sec for the entire human brain” (p. 14).

100.Kurzweil (2005): “The ‘fan out’ (number of interneuronal connections) per neuron is estimated at 10³. With an estimated 10¹¹ neurons, that’s about 10¹⁴ connections. With a reset time of five milliseconds, that comes to about 10¹⁶ synaptic transactions per second. Neuron-model simulations indicate the need for about 10³ calculations per synaptic transaction to capture the nonlinearities (complex interactions) in the dendrites and other neuron regions, resulting in an overall estimate of about 10¹⁹ cps for simulating the human brain at this level. We can therefore consider this an upper bound, but 10¹⁴ to 10¹⁶ cps to achieve functional equivalence of all brain regions is likely to be sufficient” (p. 124-125).

101.Thagard (2002): “If we count the number of processors in the brain as not just the number of neurons in the brain, but the number of proteins in the brain, we get a figure of around a billion times 100 billion, or 10¹⁷. Even if it is not legitimate to count each protein as a processor all by itself, it is still evident from the discussion in Section 3 that the number of computational elements in the brain is more than the 10¹¹ or 10¹² neurons. Moreover, the discussion of hormones and other neuroregulators discussed in Section 5 shows that the number of computationally relevant causal connections is far greater than the thousand or so synaptic connections per neuron. I do not know how to estimate the number of neurons with hormonal receptors that can be influenced by a single neuron that secretes hormones or that activates glands which secrete hormones, but the number must be huge. If it is a million, and if every brain protein is viewed as a mini-processor, then the computational speed of the brain is on the order of 10²³ calculations per second, far larger than the 10¹⁵ calculations per second that Kurzweil expects to be available by 2020, although less than where he expects computers to be by 2060. Thus quantitatively it appears that digital computers are much farther away than Kurzweil and Moravec estimate from reaching the raw computational power of the human brain” (see Section 7, “Artificial Intelligence”).

102.Tuszynski (2006): “There are four c-termini states per dimer because we have two states per monomer. There could be at least four states per electron inside the tubulin dimer, as they hop between two locations. There could be at least two computational changes due to the GTP hydrolysis. Thus there are 4 × 4 × 2, which is 32 states per dimer; thirteen dimers per ring; and 1,250 rings per midsize microtubule. If you do the math, the result is about 100 kilobytes per microtubule. Calculating the number of microtubules per neuron, you get one gigabyte of processing power per neuron. There are ten billion neurons. You have ten to the 19th bytes per brain and they oscillate or make transitions in this state on the order of nanoseconds, and ten to the 28th flops per brain” (p. 4-5 on the website).

103.von Neumann (1958): “Thus the standard receptor would seem to accept about 14 distinct digital impressions per second, which can probably be reckoned as the same number of bits. Allowing 10¹⁰ nerve cells, assuming that each one of them is under suitable conditions essentially an (inner or outer) receptor, a total input of 14 × 10¹⁰ bits per second results” (p. 63).

104.Dettmers (2015): “So my estimate would be 1.075×10²¹ FLOPS for the brain, the fastest computer on earth as of July 2013 has 0.58×10¹⁵ FLOPS for practical application (more about this below)” (see section “estimation of cerebellar input/output dimensions”).

105.See Ananthanarayanan et al. (2009), Figure 8 (p. 10). Greenemeier (2009) cites IBM’s Dharmendra Modha (one of the authors on the paper) as estimating that a computer comparable to the human brain would need to perform 4e16 operations per second, but I’m not sure his methodology.

106.Waldrop (2012): “The computer power required to run such a grand unified theory of the brain would be roughly an exaflop, or 10¹⁸ operations per second — hopeless in the 1990s. But Markram was undaunted: available computer power doubles roughly every 18 months, which meant that exascale computers could be available by the 2020s (see ‘Far to go’). And in the meantime, he argued, neuroscientists ought to be getting ready for them” (see section “Markram’s big idea”). See also this chart.

107.He also discusses a possible lower estimate around 19:43, but the video is too blurry for me to read the numbers.

108.See here. See also Izhikevich and Edelman (2007).

109.See Sandberg and Bostrom (2008) (p. 80-81). My impression is that these estimates were very rough, and their 1e18 estimate for a spiking neural network seems inconsistent with the estimate methodology they use elsewhere in the chart, since 1e15 entities × 10 FLOPs per entity × 1e3 time-steps per second = 1e19 FLOP/s.

110.Strong selection effects were like at work in determining who was present at the workshop.

111.See Moravec (1988), Chapter 2 (p. 51-74). See also Moravec (1988), Moravec (2008). I discuss this estimate in detail in Section 3.1.

112.Kurzweil (2005) also cites Zaghloul and Boahen (2006) as an example of replicating retinal functionality, but does not attempt a quantitative estimate using it (endnote 41, p. 532).

113.Kurzweil (2005): “Another estimate comes from the work of Lloyd Watts and his colleagues on creating functional simulations of regions of the human auditory system, which I discuss further in chapter 4… Watts’s own group has created functionally equivalent re-creations of these brain regions derived from reverse engineering. He estimates that 10¹¹ cps are required to achieve human-level localization of sounds. The auditory cortex regions responsible for this processing comprise at least 0.1 percent of the brain’s neurons. So we again arrive at a ballpark estimate of around 10¹⁴ cps (10¹¹ cps × 10³)” (p. 123).

114.Kurzweil (2005): “Yet another estimate comes from a simulation at the University of Texas that represents the functionality of a cerebellum region containing 10⁴ neurons; this required about 10⁸ cps, or about 10⁴ cps per neuron. Extrapolating this over an estimated 10¹¹ neurons results in a figure of about 10¹⁵ cps for the entire brain” (p. 123).

115.Kurzweil (2012): “emulating one cycle in a single pattern recognizer in the biological brain’s neocortex would require about 3,000 calculations. Most simulations run at a fraction of this estimate. With the brain running at about 10² (100) cycles per second, that comes to 3 × 10⁵ (300,000) calculations per second per pattern recognizer. Using my estimate of 3 × 10⁸ (300 million) pattern recognizers, we get about 10¹⁴ (100 trillion) calculations per second” (p. 195).

116.Drexler (2019): “In light of the above comparisons, all of which yield values of RPFLOP in the 10 to 1000 range, it seems likely that 1 PFLOP/s machines equal or exceed the human brain in raw computation capacity. To draw the opposite conclusion would require that the equivalents of a wide range of seemingly substantial perceptual and cognitive tasks would consistently require no more than an implausibly small fraction of total neural activity” (p. 188).

117.Sandberg (2016): “20 W divided by 1.3 × 10^-21 J (the Landauer limit at body temperature) suggests a limit of no more than 1.6·10²² irreversible operations per second” (p. 5).

118.De Castro (2013): “If system 1 is considered to be a powerful computer operating at maximum Landauer efficiency—i.e., at a minimum energy cost equal to kBT ln(2)—that works at an average brain temperature, the number of perceptual operations per second that it could perform is on the order of 10²³ (1/kB), depending on the idiosyncratic power of the brain” (p. 483).

119.Though there is some discussion of it on Metaculus.

120.For example, Laughlin et al. (1998) estimate that “synapses and cells are using 10⁵ to 10⁸ times more energy than the thermodynamic minimum” (the minimum they have in mind is on the order of a kT per bit “observed”); and Levy et al. (2014) argue that once the costs of communication and computation in the brain are adequately distinguished, it is possible to identify places in which the energy efficiency of neural computation approaches the minimum set by Landauer. For more on the energy efficiency of neural computation, see also Laughlin (2001), Attwell and Laughlin (2001), Balasubramanian et al. (2001), Hasenstaub et al. (2010), Levy and Baxter (1996), Skora et al. (2017), Levy and Baxter (2002), Balasubramanian and Berry (2002), Niven et al. (2007), Lennie (2003), Howarth et al. (2010), and Sarpeshkar (2010), Chapter 23. For discussions of thermodynamics in the brain in particular, see Collel and Fauquet (2015), Varpula (2013), Deli et al. (2017), and Street (2016). Work on the “free energy principle” (see e.g. Friston (2010)) in the context of the brain also has connection to thermodynamics. In a not-specifically-neural context, Kempes et al. (2017) argue: “Here we show that the computational efficiency of translation, defined as free energy expended per amino acid operation, outperforms the best supercomputers by several orders of magnitude, and is only about an order of magnitude worse than the Landauer bound” (p. 1); and Wolpert (2016) attempts to extend a version of Landauer’s reasoning to derive the minimal free energy required by an organism to run a stochastic map from sensor inputs to actuator outputs. See also Ouldridge and ten Wolde (2017), Ouldridge (2017), Sartori et al. (2014), Mehta and Schwab (2012), and Mehta et al. (2016).

121.AI Impacts: “Among a small number of computers we compared⁴, FLOPS and TEPS seem to vary proportionally, at a rate of around 1.7 GTEPS/TFLOP. We also estimate that the human brain performs around 0.18 – 6.4 × 10¹⁴ TEPS. Thus if the FLOPS:TEPS ratio in brains is similar to that in computers, a brain would perform around 0.9 – 33.7 × 10¹⁶ FLOPS.⁵ We have not investigated how similar this ratio is likely to be.” (See section “Conversion from brain performance in TEPS”).

122.See e.g. the rough estimates from Sandberg and Bostrom (2008) (p. 80-81), to the effect that emulating the states of the protein complexes in the brain would require 1e27 FLOP/s, and that emulating the stochastic behavior of single molecules in the brain would require 1e43 FLOP/s. Henry Markham, in a 2018 video (18:28), estimates the FLOP/s burdens of running a “real-time molecular simulation of the human-brain” at 4E29 FLOP/s. Today’s top supercomputers can do roughly 1e17 FLOP/s. Mike Frank projects that 1e21 FLOP/s would require more than a gigawatt of power in 2030 – comparable to the power generated by the Hoover Dam – and his chart suggests that physical limits would begin to cause serious problems for performing many orders of magnitude more than that on currently-reasonable amounts of power..

123.I first encountered the idea that the computational relevance of processes within the neuron are bottlenecked by intercellular signaling via one of our technical advisors, Dr. Dario Amodei. See also Open Philanthropy’s non-verbatim notes from a conversation with Prof. Dong Song: “Prof. Song thinks that everyone should agree that neurons are the fundamental computational unit of the brain. If you can replicate all the neuron activity, you’ll probably be able to replicate brain function. Neurons communicate with each other via spikes. Variables internal to a neuron are important to determining the neuron’s spiking behavior in response to inputs, but the other neurons do not know or care about these internal variables. So as long as you can replicate the input-output mapping at the level of spiking, you are basically replicating the relevant function of a single neuron. So if you have a good spiking neuron model, and you connect your neurons correctly, you should be able to replicate brain function” (p. 2). Robin Hanson gestures at a similar idea in the beginning of his his 2017 TED talk. My general impression was that almost all of the neuroscientists I spoke to took something like this kind of paradigm for granted.

124.“Standard” here indicates “the type of neuron signaling people tend to focus on.” Whether it is the signaling method that the brain relies on most heavily is a more substantive question.

125.In particular, the categories plausibly overlap: much of the standard neuron signaling in the brain may be in the service of what would generally be folk-theoretically understood as “learning” (see Open Philanthropy’s non-verbatim notes from a conversation with Dr. Adam Marblestone: “it might be that all of the neurons and synapses in the brain are there in order to make the brain more likely to converge on a solution while learning,” (p. 7)); various alternative signaling mechanisms (for example, neuromodulation, and signaling in certain types of glial cells) may themselves be central to learning as well.

126.Azevedo et al. (2009): “We find that the adult male human brain contains on average 86.1 ± 8.1 billion NeuN-positive cells (“neurons”) and 84.6 ± 9.8 billion NeuN-negative (“nonneuronal”) cells” (532). My understanding is that the best available method of counting neurons is isotropic fractionation, which proceeds by dissolving brain structures into a kind of homogenous “brain soup,” and then counting cell nuclei (see Herculano-Houzel and Lent (2005) for a more technical description of the process, and Bartheld et al. (2016) for a history of cell-counting in the brain). Note that there may be substantial variation in cell counts between individuals (for example, according to Bartheld et al. (2016) (p. 9), citing Haug (1986) and Pakkenberg and Gundersen (1997), neocortical neuron count may vary by a factor of more than two, though I haven’t checked these further citations).

127.See e.g. Pakkenberg et al. (2002): “Synapses have a diameter of 200–500 nm and can only be seen by electron microscopy. The primary problem in assessing the number of synapses in human brains is their lack of resistance to the decay starting shortly after death” (p. 98).

128.Kandel et al. (2013): “An average neuron forms and receives 1,000 to 10,000 synaptic connections. Thus 10¹⁴ to 10¹⁵ synaptic connections are formed in the brain” (p. 175). Henry Markram uses 1e15 total synapses in this video (18:31); AI Impacts suggests 1.8-3.2e14. A number of synapse estimates focus on the cerebral cortex, and in particular on the neocortex (the cerebral cortex is divided into two parts, the neocortex, and the allocortex, but Swenson (2006) suggests that “most of the cerebral cortex is neocortex”). For example: Tang et al. (2001), for example, write that “The average total number of synapses in the neocortex of five young male brains was 164 × 10¹² (CV = 0.17)” (p. 258); Pakkenberg et al. (2003): “The total number of synapses in the human neocortex is approximately 0.15 × 10¹⁵ (0.15 quadrillion) … On average, the neocortical neurons thus have about 7000 synapses each for intracortical reception and exchange of information” (p. 95 and 98); Zador (1999) writes that “A pyramidal neuron in the cortex receives excitatory synaptic input from 1e3 to 1e4 other neurons” (p. 1219) (he cites Shepherd (1990) for this number, though I haven’t followed up on the citation); Ananthanarayanan et al. (2009): “Cognition and computation arise from the cerebral cortex; a truly complex system that contains roughly 20 billion neurons and 200 trillion synapses” (Section 6). AI Impacts suggests that their impression is that this focus on the neocortex derives “from the assumption that the neocortex contains the great bulk of synapses in the brain” – an impression that I share. They suggest that this assumption may derive in part from the fact that the neocortex represents the bulk of the brain’s volume. The cerebral cortex contains a minority of the brain’s neurons (about 19%, according to Azevedo et al. (2009) (p. 536)), but almost all of the rest reside in the cerebellum, and about 50 billion of those are non-neocortical cerebellar granule cells (at least according to Llinás et al. (2004) (p. 277)), which appear to have a comparatively small number of synapses each: “[Granule] cells are the most numerous in the CNS; there are about 5 × 10¹⁰ cerebellar granule cells in the human brain. Each cell has four or five short dendrites (each less than 30 μm long) that end in an expansion called a dendritic claw (see fig. 7.4C in chapter 7).” Wikipedia cites Llinás et al. (2004) as grounds for attributing 80-100 synaptic connections to granule cells, but I haven’t been able to find the relevant number. The cerebellum also contains Purkinje cells (up to 1.5e7, according to Llinás et al. (2004) (p. 276)), which can have over 100,000 synapses each, though I’m not sure average number (see Napper and Harvey (1988): “We conclude that there are some 175,000 parallel fiber synapses on an individual Purkinje cell dendritic tree in the cerebellar cortex of the rat” (abstract), though this is an old estimate). I have not attempted to estimate the synapses in the cerebellum in particular, and I am not sure the extent to which synapse counts for granule cells and Purkinje cells overlap (a possibility that could lead to double counting). AI Impacts, on the basis of energy consumption and volume estimates for the neocortex, guesses the number of synapses in the entire brain is “somewhere between 1.3 and 2.3 times the number in the cerebral cortex.”

129.Wang et al. (2016): “By recording in human, monkey, and mouse neocortical slices, we revealed that FS neurons in human association cortices (mostly temporal) could generate APs at a maximal mean frequency (Fmean) of 338 Hz and a maximal instantaneous frequency (Finst) of 453 Hz, and they increase with age” (p. 1). Marblestone et al. (2013): “certain neurons spike at 500 Hz or faster (Gittis et al. (2010))” (section 2.2).

130.Barth and Poulet (2012) (p. 4-5), list a large number firing rates overserved in rat neurons, almost all of which appear to be below 10 Hz. Buzaki and Mizuseki (2014): “Recent quantifications of firing patterns of cortical pyramidal neurons in the intact brain have shown that the mean spontaneous and evoked firing rates of individual neurons span at least four orders of magnitude and that the distribution of both stimulus-evoked and spontaneous activity in cortical neurons obeys a long-tailed, typically lognormal, pattern” (p. 266). I have not attempted to calculate mean rates using the numbers in Buzaki and Mizuseki (2014). See also the studies cited by AI impacts in the section titled “estimates of the rate of firing in non-human visual cortex.”

131.Anthony Zador used an average rate of 1 Hz (see Open Philanthropy’s non-verbatim notes from a conversation with Prof. Anthony Zador, p. 4). Konrad Kording suggested that neurons run at roughly 10 Hz (see Open Philanthropy’s non-verbatim notes from a conversation with Prof. Konrad Kording). Sarpeshkar (citing Attwell and Laughlin (2001)), uses 5 Hz. Ananthanarayanan et al. (2009) suggest that the average neural firing rate is “typically at least 1 Hz” (3.1.2).

132.See p. 494-495.

133.P. 495.

134.Barth and Poulet (2012): “accumulating experimental evidence, using non-selective methods to assess the activity of identified, individual neurons, indicates that traditional extracellular recordings may have been strongly biased by selection of the most active cells” (p. 1). Buzaki and Mizuseki (2014): “Each recording technique has some caveat. For example, patch-clamping of neurons may affect the firing patterns of neurons. Cell-attached methods are less invasive, but here the identity of the recorded cell often remains unknown and one might argue that the skewed distribution simply reflects the recording of large numbers of slow-firing pyramidal cells and a smaller number of faster-discharging interneurons. Furthermore, long-term recordings are technically difficult to obtain, and this may result in biased sampling of more-active neurons. Extracellular recording of spikes with sharp metal electrodes typically offers reliable single neuron isolation; however, as in cell-attached recordings, sampling of single neurons is often biased towards selecting fast-firing cells because neurons with low firing rates are often not detected during short recording sessions. Moreover, in many cases, only evoked firing patterns in very short time windows are examined. Chronic recordings with tetrodes and silicon probes can reduce such bias towards cells with a high firing rate, as the electrodes are moved infrequently and large numbers of neurons can be monitored from hours to days. In addition, one can separate the recorded population into excitatory and inhibitory neuron types in vivo through physiological characterization or by using optogenetic methods. Caveats of the extracellular probe methods include the lack of objective quantification of spike contamination and omission, the difficulty in isolating exceedingly slow-firing neurons and the lack of objective segregation of different neuron types. The left tail of the firing-rate distribution can especially vary across studies because neurons with low firing rates are often not detected during short recording sessions or because an arbitrary cut-off rate eliminates slow-firing cells. The differences in the right tail of the distribution across studies and species are probably the result of inadequate segregation of principal cells and interneurons” (p. 276).

135.Shoham et al. (2005): “To summarize, the existence of large populations of silent neurons has been suggested recently by experimental evidence from diverse systems. Only some regions and neuron types show this phenomenon: as counterexamples, interneurons and cerebellar Purkinje cells are active most or all of the time. Nonetheless, the diversity of cases in which many neurons appear to be silent includes major neuron types in the mammalian neocortex and hippocampus, the cerebellum, and the zebra finch song system. Silent neurons may be a recurring principle of brain organization” (see Conclusion, p. 6). They also suggest that their estimate of the “recordable radius” around an electrode suggests “a silent fraction of at least 90%” of neurons in the cat primary visual cortex (see Conclusion, p. 6).

136.It’s also possible that the metabolic considerations could be used as evidence for the combinations of synapse count and average spiking rate that would be compatible with the brain’s energy budget. For example, it’s possible that 10,000 synapses per neuron is incompatible with higher average spiking rates. However, I have not investigated this. Thanks to Carl Shulman for suggesting this possibility.

137.Examples include: Bostrom (1998): “signals are transmitted along these synapses at an average frequency of about 10² Hz” (“Hardware requirements”); Mead (1990): “A nerve pulse arrives at each synapses about ten times/s, on average” (p. 1629); Merkle (1989): “There are roughly 10¹⁵ synapses operating at about 10 impulses/second”; Dix (2005): “The rate of this activity, the ‘clock period’ of the human brain is approximately 100 Hz”; Kurzweil (1999): “With 100 trillion connections, each computing at 200 calculations per second, we get 20 million billion calculations per second” (Chapter 6, “Achieving the Hardware Capacity of the Human Brain”).

138.This model of synaptic transmission was suggested by our technical advisor, Dr. Dario Amodei. See also Open Philanthropy’s non-verbatim notes from a conversation with Prof. Shaul Druckmann: “Setting aside plasticity, most people assume that modeling the immediate impact of a pre-synaptic spike on the post-synaptic neuron is fairly simple. Specifically, you can use a single synaptic weight, which reflects the size of the impact of a spike through that synapse on the post-synaptic membrane potential.

139.The bullet points below were inspired by comments from Dr. Dario Amodei as well.

140.See Matt Botvinick’s comments on this podcast: “The activity of units in a deep learning system is broadly analogous to the spike rate of a neuron” (see 57.20 here).

141.Precision, here, refers to number of bits used to represent the floating point numbers in question.

142.Koch (1999): “It is doubtful whether the effective resolution, that is, the ratio of minimal change in any one variable, such as V_m or [Ca²⁺]_i, relative to the noise amplitude associated with this variable, exceeds a factor of 100. Functionally, this corresponds to between 6 and 7 bits of resolution, a puny number compared to a standard 32-bit machine architecture” (p. 471).

143.See Bartol et al. (2015) (abstract): “Signal detection theory holds that at a Signal-to-Noise Ratio (SNR) of 1, a common detection threshold used in psychophysical experiments, an ideal observer can correctly detect whether a signal is higher or lower than some threshold 69% of the time (Green and Swets (1966); Schultz (2007)). Put another way, if random samples are drawn from two Gaussian distributions whose areas overlap by 31%, an ideal observer will correctly assign a given sample to the correct distribution 69% of the time. Using this logic, we found that ~26 different mean synaptic strengths could span the entire range, assuming CV = 0.083 for each strength level, and a 69% discrimination threshold (Figure 8, see Materials and methods)” (this quote is from the “Results” section of the paper). The “e-life digest” for the paper also suggests that previous estimates were lower than this: “This estimate is markedly higher than previous suggestions. It implies that the total memory capacity of the brain – with its many trillions of synapses – may have been underestimated by an order of magnitude. Additional measurements in the same and other brain regions are needed to confirm this possibility” (see “e-life digest”).

144.Sandberg and Bostrom (2008): “Assumption on the order of one bit of information per synapse has some support on theoretical grounds. Models of associative neural networks have an information storage capacity slightly under 1 bit per synapse depending on what kind of information is encoded (Nadal (1991); Nadal and Toulouse (1990)). Extending the dynamics of synapses for storing sequence data does not increase this capacity (Rehn and Lansner (2004)). Geometrical and combinatorial considerations suggest 3‐5 bits per synapse (Stepanyants, Hof et al. (2002); Kalisman, Silberberg et al. (2005)). Fitting theoretical models to Purkinje cells suggests that they can reach 0.25 bits/synapse (Brunel, Hakim et al. (2004))” (p. 84).

145.Zador (2019): “a few extra bits/synapse would be required to specify graded synaptic strengths. But because of synaptic noise and for other reasons, synaptic strength may not be specified very precisely” (p. 5).

146.Lahiri and Ganguli (2013): “recent experimental work has shown that many synapses are more digital than analog; they cannot robustly assume an infinite continuum of analog values, but rather can only take on a finite number of distinguishable strengths, a number than can be as small as two [4 –6] (though see [7])”.

147.Enoki et al. (2009): “The results demonstrate that individual Schaffer collateral synapses on CA1 pyramidal neurons behave in an incremental rather than binary fashion, sustaining graded and bidirectional long-term plasticity” (“summary”).

148.Siegelbaum et al. (2013c): “The mean probability of transmitter release from a single active zone also varies widely among different presynaptic terminals, from less than 0.1 (that is, a 10% chance that a presynaptic action potential will trigger release of a vesicle) to greater than 0.9” … “Thus central neurons vary widely in the efficacy and reliability of synaptic transmission. Synaptic reliability is defined as the probability that an action potential in a pre-synaptic cell leads to some measurable response in the post-synaptic cell – that is, the probability that a presynaptic action potential releases one or more quanta of transmitter. Efficacy refers to the mean amplitude of the synaptic response, which depends on both the reliability of synaptic transmission and on the mean size of the response when synaptic transmission does occur” (p. 271). Koch (1999): “We have seen that single synapses in the mammalian cortex appear to be unreliable: release at single sites can occur as infrequently as one out of every 10 times (or even less) that an action potential invades the presynaptic terminal (Fig. 4.3)” (p. 327).

149.See e.g. McDonnel and Ward (2011), Jonas (2014, unpublished), and Faisel et al. (2008) (p. 3) for discussion of the benefits of noise.

150.As Siegelbaum et al. (2013c) note, “in synaptic connections where a low probability of release is deleterious for function, this limitation is overcome by simply having many active zones [that is, neurotransmitter release sites] in one synapse” (p. 271). The fact that the brain can choose to have reliable synapses if necessary leads Koch (1999) to suggest that there may be some “computational advantage to having unreliable synapses” – for example, increasing the number of distinguishable states a synapse can be in (p. 327).

151.From Open Philanthropy’s non-verbatim notes from a conversation with Dr. Paul Christiano: “One way of modeling synaptic stochasticity is by assigning a fixed release probability to each synaptic vesicle, conditional on presynaptic activity. Dr. Christiano does not think that modeling spikes through synapses in this way would constitute a significant increase in required compute, relative to modeling each spike through synapse deterministically. Sampling from a normal distribution is cheap unless you need a lot of precision, and even then, Dr. Christiano believes that the cost is just linear in the number of bits of precision that you want. At 8 bits of precision and 10 vesicles, he expects that it would be possible to perform the relevant sampling with about the same amount of energy as a FLOP” (p. 5).

152.See Seigelbaum et al. (2013) quotes above. From Open Philanthropy’s non-verbatim notes from a conversation with Prof. Erik De Schutter: “Some hypothesize that it’s about energy efficiency, but there is no proof of this.” (p. 3).

153.From Open Philanthropy’s non-verbatim notes from a conversation with Prof. Erik De Schutter: “[synaptic stochasticity] is almost never included in neural network models” (p. 3). From Open Philanthropy’s non-verbatim notes from a conversation with Prof. Chris Eliasmith: ‘Pretty much everything Prof. Eliasmith does with his models works fine in a stochastic regime, but stochastic approaches require more synapses, so he does not bother with them. This decision is driven primarily by the availability of deterministic large-scale computational platforms. If there were cheap stochastic computers available, Prof. Eliasmith would probably use stochastic approaches” (p. 3).

154.From Open Philanthropy’s non-verbatim notes from a conversation with Prof. Erik De Schutter: “It’s an open question whether you could capture this stochasticity by drawing from a relatively simple distribution, or whether the brain manipulates synaptic stochasticity in more computationally complex ways” (p. 3).

155.This change can be modeled in different ways (for example, as an exponential decay, or as a difference of exponentials), and different post-synaptic receptors exhibit different behaviors. See Dayan and Abbott (2001) (p. 182), Figure 5.15, and the pictures of different models here.

156.Sarpeshkar (2010): “Synapses are effectively spike-dependent electrochemical gm generators [my understanding is that “gm” stands for conductance]. They convert the input digital spike impulse arriving from a presynaptic transmitting neuronal axon into an exponential analog impulse-response current on the receiving dendrite of the postsynaptic neuron” (p. 739).

157.Sarpeshkar (2010): “A synapse implements multiplication and filtering operations on every spike and sophisticated learning operations over multiple spikes. If we assume that synaptic multiplication is at least one floating-point operation (FLOP), the 20 ms second-order filter impulse response due to each synapse is 40 FLOPS, and that synaptic learning requires at least 10 FLOPS per spike, a synapse implements at least 50 FLOPS of computation per spike” (p. 748-749).

158.I’m partly influenced here by comments from Dr. Adam Marblestone, see Open Philanthropy’s non-verbatim notes from a conversation with Dr. Adam Marblestone: “If you neglect this temporal shape, you’ll get the wrong output: it matters that incoming spikes coincide and add up properly” (p. 3).

159.See Open Philanthropy’s non-verbatim notes from a conversation with Prof. Shaul Druckmann: “the long time-constant of NMDA receptors increases the complexity of the neuron’s input-output transformation” (p. 3). Beniaguev et al. (2020): “Detailed studies of synaptic integration in dendrites of cortical pyramidal neurons suggested the primary role of the voltage-dependent current through synaptic NMDA receptors, including at the subthreshold and suprathreshold (the NMDA-spike) regimes (Polsky, Mel, and Schiller (2004); Branco, Clark, and Häusser (2010)). As NMDA receptors depend nonlinearly on voltage it is highly sensitive not only to the activity of the synapse in which the receptors are located but also to the activity of (and the voltage generated by) neighboring synapses and to their dendritic location. Moreover, the NMDA-current has slow dynamics, promoting integration over a time window of tens of milliseconds (Major, Larkum, and Schiller (2013); Doron et al. (2017))” (p. 8).

160.From Open Philanthropy’s non-verbatim notes from a conversation with Prof. Anthony Zador: “He does not think that … we need to include the details of synaptic conductances in our models” (p. 1). From Open Philanthropy’s non-verbatim notes from a conversation with Dr. Adam Marblestone: “Dr. Marblestone is not sure that you need the exact shape [of the synaptic conductance], or that it needs to be re-computed every time. Specialized hardware could also be helpful (though one can say this for everything). Overall, Dr. Marblestone expects it to be possible to either leave out or simplify this computation” (p. 3).

161.My discussion of this assumption is inspired by some comments from Dr. Dario Amodei.

162.See, for example, the recent Cerebras whitepaper: “Multiplying by zero is a waste—a waste of silicon, power, and time, all while creating no new information. In deep learning, the data are often very sparse. Half to nearly all the elements in the vectors and matrices that are to be multiplied together are zeros. The source of the zeros are fundamental deep learning operations, such as the rectified linear unit nonlinearity (ReLU) and dropout, both of which introduce zeros into neural network tensors…when the data is 50 to 98% zeros, as it often is in neural networks, then 50 to 98% of your multiplications are wasted. Because the Cerebras SLA core was designed specifically for the sparse linear algebra of neural networks, it never multiplies by zero. To take advantage of this sparsity, the core has built-in, fine-grained dataflow scheduling, so compute is triggered by the data. The scheduling operates at the granularity of a single data value so only non-zero data triggers compute. All zeros are filtered out and can be skipped in the hardware. In other words, the SLA core never multiplies by zero and never propagates a zero across the fabric” (p. 5).

163.Ananthanarayanan et al. (2009): “The basic algorithm of our cortical simulator C2 [2] is that neurons are simulated in a clock-driven fashion whereas synapses are simulated in an event-driven fashion. For every neuron, at every simulation time step (say 1 ms), we update the state of each neuron, and if the neuron fires, generate an event for each synapse that the neuron is post-synaptic to and presynaptic to. For every synapse, when it receives a pre- or post-synaptic event, we update its state and, if necessary, the state of the post-synaptic neuron” (p. 3).

164.See e.g. Sandberg and Bostrom (2008) (p. 80-81); and Henry Markram, in a 2018 video (18:28).

165.From Open Philanthropy’s non-verbatim notes from a conversation with Prof. Blake Richards: “Some neuroscientists are interested in the possibility that a lot of computation is occurring via molecular processes in the brain. For example, very complex interactions could be occurring in a structure known as the post-synaptic density, which involves molecular machinery that could in principle implicate many orders of magnitude of additional compute per synapse. We don’t yet know what this molecular machinery is doing, because we aren’t yet able to track the states of the synapses and molecules with adequate precision. There is evidence that perturbing the molecular processes within the synapse alters the dynamics of synaptic plasticity, but this doesn’t necessarily provide much evidence about whether these processes are playing a computational role. For example, their primary role might just be to maintain and control a single synaptic weight, which is itself a substantive task for a biological system” (p. 2). See also Bhalla (2014): “Neurons perform far more computations than the conventional framework of summation and propagation of electrical signals from dendrite to soma to axon. There is an enormous and largely hidden layer of molecular computation, and many aspects of neuronal plasticity have been modeled in chemical terms. Memorable events impinge on a neuron as special input patterns, and the neuron has to decide if it should ‘remember’ this event. This pattern-decoding decision is mediated by kinase cascades and signaling networks over millisecond to hour-long timescales. The process of cellular memory itself is rooted in molecular changes that give rise to life-long, stable physiological changes. Modeling studies show how cascades of synaptic molecular switches can achieve this, despite stochasticity and molecular turnover. Such biochemically detailed models form a valuable conceptual framework to assimilate the complexities of chemical signaling in neuronal computation” (abstract).

166.From Open Philanthropy’s non-verbatim notes from a conversation with Prof. Barak Pearlmutter: “Prof. Pearlmutter thought that the compute for firing decisions would be “in the noise” relative to compute for spikes through synapses, because there are so many fewer neurons than synapses” (p. 2). And from Open Philanthropy’s non-verbatim notes from a conversation with Prof. Anthony Zador: “There is a big difference, computationally, between processes that happen at every synapse, and processes that only happen at the soma, because there are orders of magnitude fewer somas than synapses” (p. 2).

167.See Fig. 1. (p. 80).

168.See figure 2.

169.See figure 2. Integrate and fire models are roughly 5-15 FLOPs per ms: Hodgkin-Huxley is 1200.

170.One expert I spoke to said this, though the comment didn’t end up in the conversation notes.

171.See Fig. 3. (p. 83), in Herz et al. (2006). The two-layer cascade model they discuss resembles the one suggested by Poirazi et al. (2003). See Section 2.1.2.2 for more discussion of dendritic computation in particular.

172.From Open Philanthropy’s non-verbatim notes from a conversation with Prof. Erik De Schutter: “Old multi-compartmental models, based on cable theory, described voltage in one dimension, and the typical resolution was on the order of tens of microns per compartment. That is adequate for modeling voltage, but molecular events happen on much smaller scales. Researchers now have much more computing power available to them, and so can build more ambitious models. For instance, they can now use fully stochastic, three-dimensional “mesh” models with sub-micron resolution (typically on the order of 100 nanometers). These can incorporate molecular reactions, as well as features of cell biology like spatial models of synaptic vesicles. fully stochastic, three-dimensional “mesh” models with sub-micron resolution (typically on the order of 100 nanometers). These can incorporate molecular reactions, as well as features of cell biology like spatial models of synaptic vesicles” (p. 1-2).

173.From a review article by Brette (2015): “Do individual spikes matter or can neural computation be essentially described in terms of rates, with spikes physically instantiating this description? This contentious question has generated considerable debate in neuroscience, and is still unsettled” (p. 1). Brette lists a large number of citations relevant to the debate. It’s also possible that something else altogether matters as well (see, e.g., the discussion of other forms of axon signaling in Section 2.3.5).

174.Koch (1999) describes a standard procedure: “In a typical physiological experiment, the same stimulus is presented multiple times to a neuron and its response is recorded (Fig. 14.1). One immediately notices that the detailed response of the cell changes from trial to trial….Given the pulselike nature of spike trains, the standard procedure to quantify the neuronal response is to count how many spikes arrived within some sampling window Δt and to divide this number by the number of presentations” (p. 331). One example of a plausible role of firing rates comes from neurons in the visual cortex, whose firing rates correlate with features of visual images. Classic results in this respect include motion-sensitive neurons in the frog visual system (sometimes characterized as “bug-detectors”) (see Maturna et al. (1960) (p. 148), and Yuste (2015), in the section on “History of the neuron doctrine”) and the orientation-selectivity of neurons in V1 (Hubel and Wisel (1959), also see video here). Maheswaranathan et al. (2019) also discuss various computations performed in the retina, all of which are expressed in terms of spike rates. Examples include Latency Coding, Motion Reversal, Motion Anticipation, and the Omitted Stimulus Response. See (p. 14). See also Surya Ganguli’s description of the results at 4:56 here. Markus Meister, in a 2016 talk (34:04), also discusses a retinal ganglion cell whose firing rate appears to respond to the average of the center of the images in a naturalistic movie (its firing rate remains roughly the same when the entire movie is reduced to this simple summary)

175.See e.g. Hochberg (2012): “Raw neural signals for each channel were sampled at 30 kHz and fed through custom Simulink (Mathworks Inc., Natick, MA) software in 100 ms bins (S3) or 20 ms bins (T2) to extract threshold crossing rates; these threshold crossing rates were used as the neural features for real-time decoding and for filter calibration” (p. 5). See also this discussion at (1:02:00-1:05:00) the Neuralink Launch Event on July 16, 2019.

176.See e.g. Weiss et al. (2018): “many sensory systems use millisecond or even sub-millisecond precise spike timing across sensory neurons to rapidly encode stimulus features (e.g., visual patterns in salamanders [Gollisch and Meister (2008)], direction of sound in barn owls [Carr and Konishi (1990)], and touch location in leeches [Thomson and Kristan (2006)])” (p. 76). Zuo et al. (2015), in a discussion of perceptual decisions in the rat somatosensory cortex: “These results indicate that spike timing makes crucial contributions to tactile perception, complementing and surpassing those made by rate” (abstract). See Funabiki et al. (2011) for very temporally precise in vivo sensitivity in the auditory system of owls, though this could emerge from combining many imprecise inputs: “In owls, NL neurons change their firing rates with changes in ITD of <10 μs (Carr and Konishi (1990); Peña et al. (1996)), far below the spike duration of the neurons (e.g., ∼1 ms).”

177.Brette (2015): “Perhaps the most used argument against spike-based theories is the fact that spike trains in vivo are variable both temporally and over trials (Shadlen and Newsome (1998)), and yet this might well be the least relevant argument. This assertion is what philosophers call a ‘category error’, when things of one kind are presented as if they belonged to another. Specifically, it presents the question as if it were about variability vs. reproducibility. I will explain how variability can arise in spike-based theories, but first an important point to make is that the rate-based view does not explain variability, but rather it simply states that there is variability” (see section on “Assertion #2”). Brette goes on to list a number of objections to appeals to variability as evidence for rate-based theories.

178.One expert suggested this type of thought.

179.See e.g. Izhikevich and Edelman (2007), in the context of a neural network simulation: “We perturbed a single spike (34, 35) in this regime (out of millions) and showed that the network completely reorganized its firing activity within half a second. It is not clear, however, how to interpret this sensitivity in response to perturbations (Fig. 5). On one hand, one could say that this sensitivity indicates that only firing patterns in a statistical sense should be considered, and individual spikes are too volatile. On the other hand, one could say that this result demonstrates that every spike of every neuron counts in shaping the state of the brain, and hence the details of the behavior, at any particular moment. This conclusion would be consistent with the experimental observations that microstimulation of a single tactile afferent is detectable in human subjects (36), and that microstimulation of single neurons in somatosensory cortex of rats affects behavioral responses in detection tasks (37)” (p. 3597).

180.E.g., stochastic processes in the brain can cause a neuron to spike at one time, rather than another, without the brain’s cognitive processing breaking down. See Faisal et al. (2008) for discussion of a number of these processes.

181.See Doose et al. (2016) for one study of in vivo stimulation in rats. Sandberg (2013) argues for a more general point in this vicinity: “Brains sensitive to microscale properties for their functioning would exhibit erratic and non-adaptive behavior” (p. 260). See also Hanson (2011) for comments in a somewhat similar vein. Though note that single impulse stimulation to nerve fibers can result in sensory responses in humans: Vallbo et al. (1984): “It was confirmed that a single impulse in a single FA I unit may elicit a sensory response in the attending subject, whereas a much larger input was required from SA I units, which are also less sensitive to mechanical stimuli. This was one of several findings supporting the impression that differential receptive properties, even within a group of afferents, were associated with different sensory responses. It was concluded that a train of impulses in a single tactile unit may produce within the brain of the subject a construct which specifies with great accuracy the skin area of the unit’s terminals as well as a tactile subquality which is related to unit properties” (abstract).

182.From Open Philanthropy’s non-verbatim notes from a conversation with Prof. Chris Eliasmith: There is no “magical answer” to the question of how accurate a model of neuron spiking needs to be. In experiments fitting neuron models to spike timing data, neuroscientists pick a metric, optimize their model according to that metric, and then evaluate the model according to that metric as well, leaving ongoing uncertainty about the importance of the aspects of neural activity that the relevant metric doesn’t capture” (p. 2).

183.From Open Philanthropy’s non-verbatim notes from a conversation with Prof. Eve Marder: “It’s been hard to make progress in understanding neural circuits, because in order to know what details matter, you have to know what the circuit is doing, and in most parts of the brain, we don’t know this…It’s not that you can’t make simplifying assumptions. It’s that absent knowledge of what a piece of nervous system needs to be able to do, you have no way of assessing whether you’ve lost something fundamental or not” (p. 4); and the notes from Open Philanthropy’s non-verbatim notes from a conversation with Prof. E.J. Chichilnisky: “It’s hard to know when to stop fine-tuning the details of your model. A given model may be inaccurate to some extent, but we don’t know whether a given inaccuracy matters, or whether a human wouldn’t be able to tell the difference (though focusing on creating usable retinal prostheses can help with this)” (p. 3).

184.Keat et al. (2001): “Is this level of accuracy sufficient? In the real world, the visual system operates exclusively on single trials, without the luxury of improving resolution by averaging many responses to identical stimuli. Nor is there much opportunity to average across equivalent cells, because neurons in the early visual system tend to tile the visual field with little redundancy. Consequently, operation of the visual system under natural conditions does not require the properties of these neurons to be specified more precisely than their trial-to-trial fluctuations. To understand a neuron’s role in visual behavior, we therefore suggest that a model of the light response can be deemed successful if its systematic errors are as small as the neuron’s random errors” (p. 810). See also Open Philanthropy’s non-verbatim notes from a conversation with Dr. Stephen Baccus: “Prof. Baccus expects that there would be consensus in the field that if a model’s correlation with an individual cell’s response to a stimulus matches the correlation between that cell’s responses across different trials with that stimulus, and the model also captures all of the higher-order correlations across different cells, this would suffice to capture everything that the retina is communicating to the brain. Indeed, it would do so almost by definition” (p. 2).

185.Brette (2015): “The lack of reproducibility of neural responses to sensory stimuli does not imply that neurons respond randomly to those stimuli. There are a number of sensible arguments supporting the hypothesis that a large part of this variability reflects changes in the state of the neuron or of its neighbors, changes that are functionally meaningful” (see the section on the “State-Dependence”). See also the discussion in Faisal (2012): “The question whether this neuronal trial-to-trial variability is[:] Indeed just noise (defined in the following as individually unpredictable, random events that corrupt signals) [;] Results because the brain is to [sic] complex to control the conditions across trials (e.g. the organisms may become increasingly hungry or tired across trials) [;] Or rather the reflection of a highly efficient way of coding information [;] cannot easily be answered. In fact, being able to decide whether we are measuring the neuronal activity that is underlying the logical reasoning and not just meaning- less noise is a fundamental problem in neuroscience, with striking resemblance to finding the underlying message in cryptographic code breaking efforts (Rieke et al. (1997))” (p. 231).

186.From Open Philanthropy’s non-verbatim notes from a conversation with Dr. Stephen Baccus: “various correlation coefficient measures and information theory measures do not address the importance of the meaning of a given signal. For example, if your model misses a tiger hiding in the bushes, that’s pretty important, even though the difference might account for only a very small fraction of the correlation coefficient between your model and the retina’s response” (p. 2).

187.My thanks to Carl Shulman and Katja Grace for discussion of this analogy.

188.Naud and Gerstner (2012a) and Herz et al. (2006) for overviews of various models; and Guo et al. (2014) for a review of retinal models in particular.

189.See e.g. Schulz (2010): “the network state in vitro is fundamentally different from the in vivo situation. In acute slices in particular, background synaptic activity is almost absent.”

190.From Open Philanthropy’s non-verbatim notes from a conversation with Prof. Shaul Druckmann: “Prof. Druckmann does not think it obvious that the kind of multi-compartmental biophysical models neuroscientists generally use are adequate to capture what a neuron does, as these models, too, involve a huge amount of simplification. Calcium dynamics are the most egregious example. Real neurons clearly do things with calcium, which moves around the cell in a manner that has consequences for e.g. calcium-dependent ion channels. Most biophysical models, however, simplify this a lot, and in general, they treat ions just as concentrations affected by currents.” (p. 4).

191.From Open Philanthropy’s non-verbatim notes from a conversation with Prof. Erik De Schutter: “At this point, we have no way to reliably measure the input-output transformation of a neuron, where the input is defined as a specific spatio-temporal pattern of synaptic input. You can build models and test their input-output mappings, but you don’t really know how accurate these models are… In live imaging, it’s very difficult to see what’s happening at synapses. Some people do calcium imaging of pre-synaptic terminals, but this is only for one part of the overall synaptic input (and it may create artefacts). Currently, you cannot get a global picture of all the synaptic inputs to a single neuron. You can’t stain all the inputs, and for a big neuron you wouldn’t be able to image the whole relevant volume of space… you don’t actually know what the physiological pattern of inputs is.” See also Ujfalussy et al. (2018): “Our understanding of neuronal input integration remains limited because it is either based on data from in vitro experiments, studying neurons under highly simplified input conditions, or on in vivo approaches in which synaptic inputs were not observed or controlled, and thus a systematic characterization of the input-output transformation of neurons was not possible” (2018); and Open Philanthropy’s non-verbatim notes from a conversation with Prof. Shaul Druckmann: “It is very difficult to tell what spatio-temporal patterns of inputs are actually arriving at a neuron’s synapses in vivo. You can use imaging techniques, but this is very messy” (p. 2)

192.From Open Philanthropy’s non-verbatim notes from a conversation with Prof. Shaul Druckmann: “many dendritic non-linearities contribute more strongly when triggered by synaptic inputs arriving at similar times to similar dendritic locations (“clustering”), and there is evidence that such clustering occurs (“clustering”), and there is evidence that such clustering occurs in vivo. In this sense, a random input regime is unrepresentative, more weakly non-linear than it should be and therefore may be particularly easy to model.” (p. 3).

193.From Open Philanthropy’s non-verbatim notes from a conversation with Prof. Erik De Schutter: “Using glutamate uncaging, you can reliably activate single dendritic spines in vitro, and you can even do this in a sequence of spines, thereby generating patterns of synaptic input. However, even these patterns are limited. For example, you can’t actually activate synapses simultaneously, because your laser beam needs to move; there’s only so much you can do in a certain timeframe; and because it’s glutamate, you can only activate excitatory neurons” (p. 2). From Open Philanthropy’s non-verbatim notes from a conversation with Prof. Shaul Druckmann: “It is very difficult to tell how a neuron responds to arbitrary patterns of synaptic input. You can stimulate a pre-synaptic neuron and observe the response, but you can’t stimulate all pre-synaptic neurons in different combinations. And you can only patch-clamp one dendrite while also patch-clamping the soma (and this already requires world-class skill)” (p. 2).

194.From Open Philanthropy’s non-verbatim notes from a conversation with Prof. Erik De Schutter: “There is a tradition of integrate and fire modeling that achieves very accurate fits of neuron firings in response to noisy current injection into the soma (more accurate, indeed, than could be achieved by current biophysical models). However, this is a very specific type of experiment, which doesn’t tell you anything about what happens to synaptic input in the dendrites” (p. 2). From Open Philanthropy’s non-verbatim notes from a conversation with Prof. Shaul Druckmann: “One neuron modeling competition proceeded by assuming that dendritic inputs are randomly distributed, and that dendrites just integrate inputs linearly – assumptions used to create a pattern of current to be injected into the soma of the neurons whose spikes were recorded. If these assumptions are true, then there is good reason to think that fairly simple models are adequate. However, these assumptions are very friendly to the possibility of non-detailed modeling. The point of complex models is to capture the possibly non-linear dendritic dynamics that determine what current goes into the soma: after that point, modeling is much easier. And we don’t know to what extent non-random inputs trigger these dendritic dynamics. There were also a few other aspects of this neuron modeling competition that were not
optimal. For example, it was fairly easy to game the function used to evaluate the models” (p. 4).

195.From Open Philanthropy’s non-verbatim notes from a conversation with Prof. Markus Meister: “Information in the retina also flows in an almost exclusively feedforward direction (though there are some feedback signals, and it is an interesting question what those fibers do)” (p. 3).

196.See Meister et al. (2013) (p. 577-578). Note also that photoreceptor cells do not spike. Meister et al. (2013): “Photoreceptors do not fire action potentials; like bipolar cells they release neurotransmitter in a graded fashion using a specialized structure, the ribbon synapse” (p. 592).

197.Meister et al. (2013): “The retina is a thin sheet of neurons, a few hundred micrometers thick, composed of five major cell types that are arranged in three cellular layers separated by two synaptic layers” (p. 577). See Meister et al. (2013) (p. 578). The optic nerve also contains glial cells (see Butt et al. (2004)).

198.Note that the light actually has to travel through the ganglion cells in order to get to the photoreceptors.

199.From Open Philanthropy’s non-verbatim notes from a conversation with Prof. Markus Meister: “Information in the retina also flows in an almost exclusively feedforward direction (though there are some feedback signals, and it is an interesting question what those fibers do)” (p. 3)”

200.See Section 2.1.2.2 for discussion of Beniaguev et al. (2020); and see Section 3.1 for discussion of Maheswaranathan et al. (2019) and Batty et al. (2017)).

201.See e.g. London and Häusser (2005): “In this review we argue that this model is oversimplified in view of the properties of real neurons and the computations they perform. Rather, additional linear and nonlinear mechanisms in the dendritic tree are likely to serve as computational building blocks, which combined together play a key role in the overall computation performed by the neuron” (p. 504).

202.Stuart and Spruston (2015): “Rall and others found that the passive membrane properties of dendrites, that is, their resistance and capacitance as well as their geometry, influence the way neurons integrate synaptic inputs in complex ways, enabling a wide range of nonlinear operations” (p. 1713). For example: if you inject a high-frequency current into a dendrite, the local voltage response in that dendrite will be higher frequency and larger amplitude than the response recorded in the soma (see London and Häusser (2005) (p. 508)); when multiple inputs arrive in a similar dendritic location at the same time, the impact on the membrane potential of the first can reduce the size of the impact on the membrane potential of the other (see London and Häusser (2005) (p. 507)); and when excitatory and inhibitory inputs arrive at a similar location in the dendrite, the inhibitory input can “shunt” the excitatory input, reducing its impact on somatic membrane potential in a manner distinct from a linear sum, and perhaps even cancelling the excitatory signal entirely (see London and Häusser (2005) (p. 509)).

203.See London and Häusser (2005) (p. 509-516), and Stuart and Spruston (2015) (p. 1713-1714). If a back-propagating action potential occurs at the same time as a certain type of input to the dendrite, this can trigger a burst of somatic action potentials (see London and Häusser (2005) (p. 509)). A new class of calcium-mediated dendritic action-potentials (dCaAPs) was recently discovered in humans, and shown to make possible a type of input-output relation previously thought to require a network of neurons. Gidon et al. (2020): “we investigated the dendrites of layer 2 and 3 (L2/3) pyramidal neurons of the human cerebral cortex ex vivo. In these neurons, we discovered a class of calcium-mediated dendritic action potentials (dCaAPs) whose waveform and effects on neuronal output have not been previously described…. These dCaAPs enabled the dendrites of individual human neocortical pyramidal neurons to classify linearly non-separable inputs—a computation conventionally thought to require multilayered networks” (from the abstract).

204.See Reyes (2001), London and Häusser (2005), Stuart and Spruston (2015), Payeur et al. (2019), and Poirazi and Papoutsi (2020) for reviews.

205.See discussion of synaptic clustering on p. 310 of Poirazi and Papoutsi (2020), though they also suggest that “The above predictions suggest that dendritic — and, consequently, somatic — spiking is not necessarily facilitated by synaptic clustering, as was previously assumed” (p. 310).

206.Moore et al. (2017): “The dendritic spike rates, however, were fivefold greater than the somatic spike rates of pyramidal neurons during slow-wave sleep and 10-fold greater during exploration. The high stability of dendritic signals suggested that these large rates are unlikely to arise due to the injury caused by the electrodes” (p. 1 of “Research Article Summary”).

207.Moore et al. (2017): “the total energy consumption in neural tissue … could be dominated by the dendritic spikes” (p. 8). The Science summary here also notes that dendrites occupy more than 90% of neuronal tissue.

208.See London and Häusser (2005) (p. 516-524), and Payeur et al. (2019) for examples. See also Schmidt-Hiever et al. (2017): “Our results suggest that active dendrites may therefore constitute a key cellular mechanism for ensuring reliable spatial navigation” (abstract).

209.Stephen Baccus recalled estimates from Bartlett Mel to the effect that something in the range of five dendritic sub-units would be sufficient (see Open Philanthropy’s non-verbatim notes from a conversation with Dr. Stephen Baccus, p. 3). Markus Meister also suggested that models of cortical pyramidal cells that include two point neurons – one for the dynamics at the soma, and the other for the dynamics in the apical tuft – can account for a lot of what’s going on (see Open Philanthropy’s non-verbatim notes from a conversation with Prof. Markus Meister, p. 4). From Open Philanthropy’s non-verbatim notes from a conversation with Prof. Anthony Zador: “Much of Prof. Zador’s PhD work was devoted to the hypothesis that dendritic computation is the key difference between artificial neural networks and real brains. However, at the end of the day, he was led to the conclusion that dendritic computation does not make a qualitative difference to the computational capacity of a neuron. There is some computational boost, but the same effect could be achieved by replacing each biological neuron with a handful of artificial neurons” (p. 3). See also Naud et al. (2014): “We conclude that a simple two-compartment model can predict spike times of pyramidal cells stimulated in the soma and dendrites simultaneously. Our results support that regenerating activity in the apical dendritic is required to properly account for the dynamics of layer 5 pyramidal cells under in-vivo-like conditions” (abstract). See also Ujfalussy et al. (2018), though I’m not sure exactly how complex their model was: “We used the hLN to predict the somatic membrane potential of an in vivo-validated detailed biophysical model of a L2/3 pyramidal cell. Linear input integration with a single global dendritic nonlinearity achieved above 90% prediction accuracy.” (abstract).

210.See Li et al. (2019): “We derive an effective point neuron model, which incorporates an additional synaptic integration current arising from the nonlinear interaction between synaptic currents across spatial dendrites. Our model captures the somatic voltage response of a neuron with complex dendrites and is capable of performing rich dendritic computations” (p. 15246).

211.From Open Philanthropy’s non-verbatim notes from a conversation with Prof. Chris Eliasmith: “There are also arguments that certain forms of active dendritic computation function to “linearize” the inputs – e.g., to combat the attenuation of an input signal as it travels through the dendritic tree, such that the overall result looks more like direct injection into the soma” (p. 3-4).

212.For example, various results explore the computational role of active computation in the apical dendrite of cortical pyramidal cells (see London and Häusser (2005) for examples). For results related to dendritic computation that does happen in the retina, see Taylor et al. (2000) and Hanson et al. (2019).

213.I’m not sure exactly what grounds this suggestion, but it is consistent with a number of abstract models of dendritic computation. See Poirazi et al. (2003); Tzilivaki et al. (2019); Jadi et al. (2014); and Ujfalussy et al. (2018). All of these use sigmoidal non-linearities in dendritic subunits. See e.g. Ujfalussy et al. (2018)“We chose a sigmoid nonlinearity for several reasons. First, the sigmoid has been proposed elsewhere as an appropriate dendritic nonlinearity (Poirazi et al., 2003a, Polsky et al., 2004). Second, under different parameter settings and input statistics, the sigmoid is sufficiently flexible to capture purely linear, sublinear, and supralinear behavior, as well as combinations thereof.”

214.It is possible to formulate and prove this sort of limitation using graph theory. However, the proof is quite long, and I won’t include it here.

215.Some assumption is required here to the effect that the non-linearities themselves can’t be that expensive, and/or performed many times in a row. I haven’t explored this much, but I could imagine questions about the interchangeability of nonlinearities in artificial neural networks being relevant (see discussion in next section). Poirazi et al. (2003), Tzilivaki et al. (2019), Jadi et al. (2014), and Ujfalussy et al. (2018) all use sigmoidal non-linearities, a standard version of which (y = 1 / (1 + exp^-x)) appears to be ~4 FLOPs (see “Activation Functions” here).

216.See the notes from Open Philanthropy’s non-verbatim notes from a conversation with Dr. Adam Marblestone (p. 5):

As Dr. Marblestone understands this argument, the idea is that while there may well be dendritic non-linearities, you should expect a tree-like structure of local interactions, and activity in one part of the tree can’t exert fast, long-range influence on activity in another part. This rules out scenarios where, for example, any synapse can communicate with any other – a scenario in which required compute could scale with the square of the number of synapses. This argument is consistent with Dr. Marblestone’s perspective, and he thinks it is very interesting, though it would be nice to formalize it more precisely.

From Open Philanthropy’s non-verbatim notes from a conversation with Prof. Barak Pearlmutter (p. 2):

Prof. Pearlmutter was sympathetic to the idea that the tree-structure of dendrites would limit the compute burdens that dendritic computation could introduce. There is an important distinction between causal models that are tree-structured and ones that are not tree-structured. Non-tree structured causal model can have cycles that quickly become very computationally expensive, whereas tree structured models are comparatively easy to compute. He suggested that this type of consideration applies to dendrites as well (including in the context of feedbacks between the dendrites and the soma). Prof. Pearlmutter thought it a fairly good intuition that dendritic computation would only implicate a small constant factor increase in required compute, though very complicated local interactions could introduce uncertainty.

From Open Philanthropy’s non-verbatim notes from a conversation with Prof. Chris Eliasmith (p. 3):

Prof. Eliasmith believes that neurons probably have non-linearities in their dendrites. In attempting to construct models of attention, for example, he has found that he needs more model neurons than seem biologically realistic, and the neuron count would go way down if he had certain kinds of non-linearities in the dendrites. Including these non-linearities would not drastically increase compute burdens (it might be equivalent to a 2× increase). A simple version would basically involve treating a single neuron as a two-layer neural network, in which dendrites collect inputs and then perform a non-linearity before passing the output to the soma. Prof. Eliasmith is sympathetic to the idea that the tree-structure of dendrites limits the additional complexity that dendritic computation could implicate in the context of such multi-layer networks (e.g., the tree-structure limits the outgoing connections of a dendritic sub-unit, and additional non-linearities in the neuron do not themselves add much compute in a regime where spikes through synapses are already the dominant compute burden). That said, there are many mechanisms in neurons that could in principle make everything more complicated.

217.From Open Philanthropy’s non-verbatim notes from a conversation with Prof. Shaul Druckmann: “Prof. Druckmann does not think that appeals to the manageable compute burdens of modeling of dendrites as comparatively small multi-layer neural networks (for example, with each dendritic sub-unit performing its own non-linearity on a subset synaptic inputs) definitively address the possibility that modeling dendritic non-linearities requires very large amounts of compute. Small multi-layer network models are really just a guess about what’s required to capture the neuron’s response to realistic inputs. For example, in a recent unpublished paper, David Beniaguev, Idan Segev, and Michael London found that adding NMDA currents to the detailed model increased the size of the neural network required to replicate its outputs to seven layers (the long time-constant of NMDA receptors increases the complexity of the neuron’s input-output transformation). Adding in other neuron features could require many more layers than this. 10 layers might be manageable, but 500 is a pain, and the true number is not known” (p. 3).

218.This type of illustration was also suggested by Dr. Amodei.

219.See Poirazi et al. (2003); Tzilivaki et al. (2019); Jadi et al. (2014); and Ujfalussy et al. (2018).

220.From Open Philanthropy’s non-verbatim notes from a conversation with Dr. Paul Christiano: “A ReLU costs less than a FLOP. Indeed, it can be performed with many fewer transistors than a multiply of equivalent precision” (p. 6). See here for some discussion of the FLOPs costs of a tanh, and here for discussion of exponentials. A standard sigmoid activation (y = 1 / (1 + exp^-x)) appears to be ~4 FLOPs (see “Activation Functions” here). Poirazi et al. (2003) use various sigmoids in this vein, see Figure 5.

221.This factor is centrally determined by the ratio of FLOPs per input to FLOPs per non-linearity. This is 10x in the example above, but this is on the high end for non-linearities in ANNs.

222.Thus, for example, assuming 1000 inputs and a 1 Hz average firing rate, on average there will be one spike through synapse per 1 ms timestep. If we budget 1 FLOP per spike through synapse, but assume 100 dendritic sub-units, each performing non-linearities on 10 synaptic input connections each, and we assume that everything but spikes through synapses must be computed every time-step, we get the following budget per 1 ms timestep:

Point neuron model (assuming sparse FLOP/s for synaptic transmission):
Soma: 1 FLOPs (average number of input spikes per ms) + 10 FLOPs (non-linearity)
Total: 11 FLOPs
Sub-unit model:
Dendrites: 100 (subunits) × (.01 FLOPs (average number spikes through synapse per 10 synapses per ms) + 10 FLOPs (non-linearity))
Soma: 100 FLOPs (additions from sub-unit outputs) + 10 FLOPs (non-linearity)
Total: ~1110 FLOPs

223.Beniaguev et al. (2020): “A thorough search of configurations of deep and wide fully-connected neural network architectures (FCNs) have failed to provide a good fit to the I/O characteristics of the L5PC model. These failures suggest a substantial increase in the complexity of I/O transformation compared to that of I&F. Indeed, only temporally convolutional network architecture (TCN) with 7 layers and 128 channels per layer, provided a good fit (Fig. 2B, C Fig. S5)” (p. 7).

224.Beniaguev et al. (2020): “We hypothesized that removing NMDA dependent synaptic currents from our L5PC model will significantly decrease the size of the respective DNN… after removing the NMDA voltage dependent conductance, such that the excitatory input relies only on AMPA mediated conductances, we have managed to achieve a similar quality fit as in Fig. 2 when using a much smaller network – a fully connected DNN (FCN) with 128 hidden units and only a single hidden layer (Fig. 3B). This significant reduction in complexity is due to the ablation of NMDA channels” (p. 8-10).

225.Here’s my estimate, which the lead author tells me looks about right. 1st layer: 1278 synaptic inputs × 35 × 128 = 5.7 million MACCs (from line 140 and lines 179-180 here); Next 6 layers: 6 layers × 128 × 35 × 128 = 3.4 million MACCs. Total per ms: ~ 10 million MACCs. Total per second: ~10 billion MACCs. Multiplied by 2 to count individual FLOPs (see “It’s dot products all the way down” here) = ~20 billion FLOP/s per cell. Though the authors also note that “the accuracy of the model was insensitive to the temporal kernel sizes of the different DNN layers when keeping the total temporal extent of the entire network fixed, so the temporal extent of the first layer was selected to be larger than subsequent layers mainly for visualization purposes” (p. 7). I’m not sure what kind of difference this might make. Note also that this is still less than the biophysical model itself, which they say ran several orders of magnitude slower: “Note that, despite its seemingly large size, the resulting TCN represents a substantial decrease in computational resources relative to a full simulation of a detailed biophysical model (involving numerical integration of thousands of nonlinear differential equations), as indicated by a speedup of simulation time by several orders of magnitude” (p. 8).

226.Beniaguev et al. (2020) (p. 15):

It is important to emphasize that, due to optimization, the complexity measure described above is an upper bound of the true computational complexity of the I/O of a single neuron, i.e., it is possible that there exists a much smaller neural network that could mimic the biophysical neuron with a similar degree of accuracy but the training process we used could not find it. Additionally, we note that we have limited our architecture search space only to fully connected (FCN) and temporally convolutional (TCN) neural network architectures. It is likely that additional architectural search could yield even simpler and more compact models for any desired degree of prediction accuracy. In order to facilitate this search in the [sic] scientific community, we hereby release our large readymade [sic] dataset of simulated inputs and outputs of a fully complex single layer 5 cortical neuron in an invivo [sic] like regime so that the community can focus on modelling various aspects of this endeavour and avoid running the simulations themselves.

227.Beniaguev et al. (2020): “now that we estimate that a cortical L5 pyramidal neuron is equivalent to a deep network with 7 hidden layers, this DNN could be used to teach the respective neuron to implement a function which is in the scope of the capabilities of such a network, such as classifying hand written digits or a sequence of auditory sounds. One can then both validate the hypothesis that single neurons could perform complex computational tasks and investigate how these neurons can implement such complex tasks” (p. 16).

228.Though see Open Philanthropy’s non-verbatim notes from a conversation with Dr. Paul Christiano: “Dr. Christiano is very skeptical of the hypothesis that a single, biological cortical neuron could be used to classify handwritten digits” (p. 6).

229.From Open Philanthropy’s non-verbatim notes from a conversation with Prof. Eve Marder: “You can see maintaining these rhythms as the high-level function that the circuit is performing at a given time (transitions between modes of operation are discussed below). Neuroscientists had a wiring diagram for the pyloric rhythm in 1980, and there was a fairly good first-principles idea of how it worked back then. It is not too difficult to model tri-phasic rhythm” (p. 1).

230.From Open Philanthropy’s non-verbatim notes from a conversation with Prof. Eve Marder: “Prof. Marder and her collaborators have used single-compartment conductance models to replicate the rhythms in the stomatogastric ganglion” (p. 4). And from Open Philanthropy’s non-verbatim notes from a conversation with Dr. Adam Marblestone: “These neurons create oscillations that can be very well modeled and understood using Hodgkin-Huxley type neuron models” (p. 4).

231.E.g., if what matters about these rhythms is that just that units activate in a certain regular, rhythmic sequence (I’m not sure about the details here, and the full range of dynamics that matter could be much more complicated), it seems possible to create this sort of sequence in a very non-brain-like way. That said, achieving the brain’s level of robustness and flexibility in maintaining these rhythms across different circumstances is a different story.

232.Prinz et al. (2004): “To determine how tightly neuronal properties and synaptic strengths need to be tuned to produce a given network output, we simulated more than 20 million versions of a three-cell model of the pyloric network of the crustacean stomatogastric ganglion using different combinations of synapse strengths and neuron properties. We found that virtually indistinguishable network activity can arise from widely disparate sets of underlying mechanisms, suggesting that there could be considerable animal-to-animal variability in many of the parameters that control network activity, and that many different combinations of synaptic strengths and intrinsic membrane properties can be consistent with appropriate network performance” (p. 1345). See also Marder and Goaillard (2006) for review of other related findings, for example Figure 2, “Neurons with similar intrinsic properties have different ratios of conductances” (p. 566), Figure 4, “Similar network behavior with different underlying conductances” (p. 569) and Figure 6, “Constancy of network performance despite major size changes during growth” (p. 571). See also the non-verbatim notes from Open Philanthropy’s non-verbatim notes from a conversation with Dr. Adam Marblestone: “There are important molecular mechanisms at work, but these function to make the circuit robust. For example, across crabs, gene expression levels in equivalent stomatogastric neurons vary a lot, but they are correlated within a given crab, suggesting that there are many different gene expression solutions that can create the same functioning network, and that the cell’s mechanisms are set up to make sure the neurons find such a solution. This system has many different possible states, which can be induced by different neuromodulators. But in any given one of those states, the real-time, fast computation is fairly understandable. Perhaps the whole brain is like that” (p. 4).

233.From Open Philanthropy’s non-verbatim notes from a conversation with Prof. Eve Marder: “Biology has found a series of mechanisms that allow the system to transition smoothly between different modes of operation. For example, you can walk slowly or quickly. Although eventually you will change gait. Prof. Marder believes that such smooth transitions are centrally important to understanding brains, especially big brains. The mechanisms involved allow brains to avoid having to fine-tune or find singular solutions. However, most computational models don’t capture these transitions. For example, if you want to capture the behavior of an eight channel neuron with a three channel model, you’ll hit nasty bifurcations. Indeed, one hypothesis is that neurons have many ion channels with overlapping functions because this facilitates smooth transitions between states” (p. 2).

234.Locusts jump out of the way when you show them a “looming stimulus” – that is, a visual stimulus that grows in size in a manner that mimics an object on a collision course with the locust (see videos here and slower-motion here). In a particular locust neuron known as the lobula giant movement detector (LGMD), the firing rate of this neuron increases, peaks, and decreases as collision with the object appears to become imminent, and the peak firing rate occurs with a fixed delay after the object reaches a particular threshold angular size on the retina (See Fotowat and Gabbiani (2011) (p. 4)). Gabbiani et al. (2002) hypothesize that this angular size “might be the imaged-based retinal variable used to trigger escape responses in the face of an impending collision. Indeed, a leg flexion (presumably in preparation for an escape jump) has been shown to follow the peak LGMD firing rate with a fixed delay” (p. 320). The LGMD also synapses onto a further neuron – the descending contralateral movement detector (DCMD) – that connects to motor neurons responsible for jumping, and which itself fires every time the LGMD fires. The timing of take-off can be very well predicted from the peak firing rate of the DCMD (see Fotowat and Gabbiani (2011) (p. 12)). What’s more, examination of the physiology of the neuron supports a particular hypothesis about how its biological hardware implements this function. The dendritic tree of the LGMD can be divided into two portions – an excitatory portion and an inhibitory portion. The excitatory portion receives input from the visual system roughly proportionate to the angular velocity (that is, the rate of change of the angular size) of the stimulus raised to the power of two to three, and then outputs positive current roughly proportionate to the logarithm of angular velocity. The inhibitory portion, by contrast, receives input roughly proportionate to the square of the angular size of the stimulus, and outputs negative current in an approximately linear relationship to the angular size of the stimulus (the relationship is actually best described by a sigmoid, but it is treated as linear in the overall model). These positive and negative currents then combine at the spike initiation zone in a manner that results in an overall membrane potential that reflects the sum of the positive and negative currents. The average spiking rate of the neuron is then proportionate to the membrane potential raised to the power three, which is roughly equivalent to an exponential at the relevant scales (see Jones and Gabbiani (2012), Figure 8, for a description of this hypothesis, together with Christof Koch’s discussion here).

235.See Fig 1 in Jadi et al. (2014) for some other examples of circuit models using point neuron models. They cite Raymond et al. (1996) for cerebellar circuit models; Raphael et al. (2010) for a model of the spinal cord; and Crick (1984) for a model of attention. Grid cells might be another example, and the Jeffress model of auditory coincidence detection. See also Open Philanthropy’s non-verbatim notes from a conversation with Prof. Eve Marder: “There are also some circuits in leeches, C. elegans, flies, and electric fish that are relatively well-characterized” (p. 4).

236.From Open Philanthropy’s non-verbatim notes from a conversation with Dr. Stephen Larson: “There may be selection bias at work in appeals to the success of simple models in some contexts as evidence for their adequacy in general. With respect to phenomena that simple models have thus far failed to explain, such explanation might not be possible” (p. 4).

237.I’m partly influenced here by discussions with Dr. Adam Marblestone, see Open Philanthropy’s non-verbatim notes from a conversation with Dr. Adam Marblestone: “Dr. Marblestone does not think that selection effects nullify the evidence provided by our understanding of peripheral sensory and motor systems. E.g., it’s not that we did experiments on a bunch of systems, and some of them we couldn’t figure out, and some of them we could. Rather, the distribution of neuroscientific success has more to do with our experimental access to peripheral sensory/motor systems, together with differences in the types of theories you would need to have in order to explain more architecturally-complex circuits deeper in the brain. Similarly, Dr. Marblestone does not think that the fact that we can’t simulate C. elegans is a good argument for any kind of special computation taking place within C. elegans neurons. Lots of other explanations are available: notably, that it’s very difficult to figure out the right parameters” (p. 8). See also the section in the notes from Open Philanthropy’s non-verbatim notes from a conversation with Prof. Markus Meister entitled “Scientific advantages of peripheral systems” (p. 2-3), as well as Open Philanthropy’s non-verbatim notes from a conversation with Prof. Eve Marder (p. 4), section title: “The epistemic barriers to understanding circuits.”

238.From Open Philanthropy’s non-verbatim notes from a conversation with Dr. Stephen Larson, who works on the OpenWorm project: “Despite its small size, we do not yet have a model that captures even 50% of the biological behavior of the C. elegans nervous system. This is partly because we’re just getting to the point of being able to measure what the worm’s nervous system is doing well enough” (p. 1). David Dalrymple, who used to work on emulating C. elegans, writes: “What you actually need is to functionally characterize the system’s dynamics by performing thousands of perturbations to individual neurons and recording the results on the network, in a fast feedback loop with a very very good statistical modeling framework which decides what perturbation to try next.” Sarma et al. (2018), in an overview of OpenWorm’s progress, write: “The level of detail that we have incorporated to date is inadequate for biological research. A key remaining component is to complete the curation and parameter extraction of Hodgkin–Huxley models for ion channels to produce realistic dynamics in neurons and muscles” (Section 3).

239.From Open Philanthropy’s non-verbatim notes from a conversation with Dr. Adam Marblestone: “Some neural circuits, like ones in the spinal cord, are very simple. And one can imagine primitive synapses, involved in primitive computations like “if you get some dopamine, move this part of the jellyfish like so.” Genetic programs build these machines on the basis of relatively simple specifications, and you have to be able to reliably repurpose these machines without every molecule mattering. Dr. Marblestone expects that evolution proceeded by reusing and recombining these relatively simple, reliable components” (p. 4-5). See also Open Philanthropy’s non-verbatim notes from a conversation with Prof. Jared Kaplan: “It is theoretically possible that there is a large amount of additional computation taking place within neurons, but this seems very implausible, and Prof. Kaplan finds it difficult to evaluate arguments that condition on this possibility. One reason this seems implausible is that neurons aren’t that different across species, and it does not seem plausible to Prof. Kaplan that in simple species with very few neurons, large amounts of computation are taking place inside the neurons. One would need a story about when this complex internal computation developed in the evolutionary history of neurons” (p. 2-3).

240.Though see also comments from Open Philanthropy’s non-verbatim notes from a conversation with Prof. Erik De Schutter: “The brain was not engineered. Rather, it evolved, and evolution works by adding complexity, rather than by simplification. There are good reasons for this complexity. In order to evolve, you can’t have systems, at any level (proteins, channels, cells, brain regions), with unique functions. If you did, and a single mutation knocked out the function, the whole system would crash. Whereas if you have overlapping functions, performance suffers somewhat, but something else can take over. If you don’t allow for this, you can’t evolve, since evolution works by random mutations, and most mutations are not positive” (p. 4).

241.Dr. Dario Amodei suggests considerations in this vein, though I’m not sure I’ve understood what he has in mind. See also Open Philanthropy’s non-verbatim notes from a conversation with Prof. Jared Kaplan: “most of his probability mass on the hypothesis that most of the computation performed by the brain is visible as information transferred between synapses… It is theoretically possible that there is a large amount of additional computation taking place within neurons, but this seems very implausible” (p. 2); and my discussions of the communication method with Dr. Paul Christiano, see Open Philanthropy’s non-verbatim notes from a conversation with Dr. Paul Christiano. That said, Amodei, Christiano, and Kaplan all work at the same organization (OpenAI), so their beliefs and arguments may be correlated due to internal discussion.

242.From Open Philanthropy’s non-verbatim notes from a conversation with Dr. Paul Christiano: “Neurons receive only a limited number of bits in, and they output only a limited number of bits. However, in principle, you can imagine computational elements receiving encodings of computationally intensive problems via their synaptic inputs (e.g., “is this boolean formula satisfiable?”), and then outputting one of a comparatively small set of difficult-to-arrive-at answers.” (p. 6).

243.Here I’m using a rough estimation method suggested by Dr. Paul Christiano, from Open Philanthropy’s non-verbatim notes from a conversation with Dr. Paul Christiano: “You can roughly estimate the bandwidth of axon communication by dividing the firing rate by the temporal resolution of spiking. Thus, for example, if the temporal precision is 1 ms, and neurons are spiking at roughly 1 Hz, then each spike would communicate ~10 bits of information (e.g., log₂(1000)). If you increase the temporal precision to every microsecond, that’s only a factor of two difference (e.g., log₂(1,000,000) = ~20 bits)” (p. 2). There is a large literature on the information carried by action potentials that I’m not engaging with. See Dayan and Abbott (2001), Chapter 4 (p. 123-150); Zador (1998); Tsubo et al. (2012), Fuhrmann et al. (2001), Mainen and Sejnowski (1995), and van Steveninck et al. (1997).

244.See here, and more discussion of the difficulties here.

245.From Open Philanthropy’s non-verbatim notes from a conversation with Prof. Markus Meister: “Prof. Meister thinks that people often overestimate the sophistication of the tasks that humans perform, which tend to involve low-bandwidth outputs. People have measured the bits per second involved in different types of motor outputs (e.g., typing, playing piano, athletics, speaking speed, etc.), and the numbers are in the range of 10-40 bits per second. Similarly, people have tried to measure the information rate of human thought (for example, by seeing how much information humans can retain per second in reading), and it’s in the same ballpark” (p. 5).

246.Izhikevich (2004): “The most common type of excitatory neuron in mammalian neocortex, namely the regular spiking (RS) cell, fires tonic spikes with decreasing frequency, as in Fig. 1(f). That is, the frequency is relatively high at the onset of stimulation, and then it adapts. Low-threshold spiking (LTS) inhibitory neurons also have this property. The interspike frequency of such cells may encode the time elapsed since the onset of the input” (p. 1064); “Most cortical neurons fire spikes with a delay that depends on the strength of the input signal. For a relatively weak but superthreshold input, the delay, also called spike latency, can be quite large, as in Fig. 1(i). The RS cells in mammalian cortex can have latencies of tens of ms. Such latencies provide a spike-timing mechanism to encode the strength of the input” (p. 1065).

247.Izhikevich (2004): “The most efficient is the I&F model. However, the model cannot exhibit even the most fundamental properties of cortical spiking neurons, and for this reason it should be avoided by all means. The only advantage of the I&F model is that it is linear, and hence amenable to mathematical analysis. If no attempts to derive analytical results are made, then there is no excuse for using this model in simulations” (p. 1069). See also Jolivet et al. (2008b): “What follows from the results of challenge A displayed in Tables 1 and 2 is that standard leaky integrate-and-fire models or other off-the-shelf methods are not sufficient to account for the variety of firing patterns and firing rates generated by a single neuron. The conclusion is that one has to include some dynamics in the threshold so as to achieve two things: first, to account in some rough fashion for neuronal refractoriness, and, second, to gain some flexibility in matching the mean firing rates across different stimulation paradigms. We had already shown that predicting subthreshold membrane voltage is relatively easy (Jolivet et al. (2006a)). Predicting the exact timing of spikes is where the difficulty resides” (p. 425).

248.From Open Philanthropy’s non-verbatim notes from a conversation with Prof. Dong Song: “The functional impact of ion channel dynamics in the context of a Hodgkin-Huxley model is highly redundant. This makes Prof. Song think that Hodgkin-Huxley models can be simplified – e.g. you can replicate the input-output behavior of the Hodgkin-Huxley model, with fewer equations. Indeed, this almost has to be the case. There are also studies that show that many different combinations of ionic channels can generate the same overall behavior, both for a single neuron and a small neuronal circuit” (p. 2).

249.He cites Hoppensteadt and Izhikevich (2001), in which he goes into more detail: “Briefly, a model is canonical for a family if there is a continuous change of variables that transforms any other model from the family into this one, as we illustrate in Figure 1. For example, the entire family of weakly coupled oscillators of the form (1) can be converted into the canonical phase model (6), where Hij depend on the particulars of the functions fi and gij. The change of variables does not have to [be] invertible, so the canonical model is usually lower-dimensional, simple, and tractable. Yet, it retains many important features of the family. For example, if the canonical model has multiple attractors, then each member of the family has multiple attractors..” (p. 1).

250.Here is a summary of recent AI progress from Hassabis et al. (2017): “In AI, the pace of recent research has been remarkable. Artificial systems now match human performance in challenging object recognition tasks (Krizhevsky et al. (2012)) and outperform expert humans in dynamic, adversarial environments such as Atari video games (Mnih et al. (2015)), the ancient board game of Go (Silver et al. (2016)), and imperfect information games such as heads-up poker (Moravčík et al. (2017)). Machines can autonomously generate synthetic natural images and simulations of human speech that are almost indistinguishable from their real-world counterparts (Lake et al. (2015), van den Oord et al. (2016)), translate between multiple languages (Wu et al. (2016)), and create “neural art” in the style of well-known painters (Gatys et al. (2015))” (p. 250). See also LeCun et al. (2015) for a review of deep learning progress. Other recent advances include OpenAI et al. (2019), Vinyals et al. (2019), Radford et al. (2019), Brown et al. (2020).

251.See Kriegeskorte (2015) and Nielsen’s “Neural Networks and Deep Learning” for general introductions.

252.From Open Philanthropy’s non-verbatim notes from a conversation with Prof. Eric Jonas: “Prof. Jonas does not think that there is a clear meaning to the claim that the brain is a deep learning system, and he is unconvinced by the argument that ‘the brain is doing optimization, and what is deep learning but optimization?’. He also has a long-term prior that researchers are too quick to believe that the brain is doing whatever is currently popular in machine learning, and he doesn’t think we’ve found the right paradigm yet” (p. 3).

253.From Open Philanthropy’s non-verbatim notes from a conversation with Prof. Anthony Zador: “In the early days of neural networks, people thought you needed sigmoid activation functions, and that piecewise linear models could not work because they are not differentiable. But it turns out that computers can handle the function having one non-differentiable point, so the two are largely interchangeable, and it’s fine to go with the more convenient option. The main constraint is that the function needs to be monotonically increasing. This is an example of a case in which the precise function generating a neuron’s output does not matter” (p. 2). See also Kriegeskorte (2015): “The particular shape of the nonlinear activation function does not matter to the class of input–output mappings that can be represented” (p. 422); and Tegmark (2017): “It’s been proven that almost any function will suffice as long as it’s not linear (a straight line)” (p. 72, endnote 5).

254.See Matthew Botvinick’s comments in this podcast: “I consider the networks we use in deep learning research to be a reasonable approximation to the mechanisms that carry information in the brain…If you go back to the 1980s, there’s an unbroken chain of research in which a particular strategy is taken, which is: hey, let’s train a deep learning system, let’s train a multi-layer neural network, on this task that we trained our rat on, or our monkey on, or this human being on, and let’s look at what the units deep in the system are doing, and let’s ask whether what they’re doing resembles what we know about what neurons deep in the brain are doing; and over and over and over and over and over, that strategy works, in the sense that, the learning algorithms that we have access to, which typically center on backpropagation, they give rise to patterns of activity, patterns of response, patterns of neuronal behavior in these artificial models, that look hauntingly similar to what you see in the brain. Is that a coincidence? … the circumstantial evidence is overwhelming” (see (53:00-1:00:00 here).

255.From Open Philanthropy’s non-verbatim notes from a conversation with Dr. Paul Christiano, (p. 6).

256.Sandberg (2013): “The noise level in the nervous system is fairly high, with spike-timing variability reaching milliseconds due to ion channel noise. Perceptual thresholds and motor precision are noise limited. Various noise management solutions such as redundant codes, averaging and bias have evolved (Faisal et al. (2008)). In synapses the presynaptic transient influx of calcium ions as a response to an action potential corresponds to just 13,000 ions (Koch (1999)) (p. 458), and on the postsynaptic side just 250 ions (Koch (1999))(p. 302). These numbers are so small that numeric noise begins to be significant, and the chemical dynamics can no longer be described as average concentrations. However, biological systems can resist the discretization noise through error correction mechanisms that lead to discrete attractor dynamics, in line with the evidence that synaptic plasticity involve discrete changes rather than graded response (Ajay and Bhalla (2006) Bhalla (2004) and Elliott (2011)). It is hence not implausible that there exist sufficient scale separation on the synaptic and neuronal level: information is transmitted in a discrete code (with a possible exception of timing) between discrete entities. At finer resolution thermal and chemical noise will be significant, suggesting that evolution would have promoted error correction and hence scale separation” (p. 261). See also Open Philanthropy’s non-verbatim notes from a conversation with Prof. Konrad Kording: “If you want upper bounds on required compute, you can look at the parts list of the computing elements in the brain, the noisiness of which will put physical limits on the amount of computation they can do. This might result in very high estimates. For example, it might say that every ion channel does a bit roughly every ten milliseconds. This approach doesn’t necessarily rule out molecules and proteins as possible avenues of computation. However, some molecules may equilibrate so fast that you can replace them with a variable that describes their average state (e.g., mean field theory is applicable). You can’t do this across a neuron: there are NMDA spikes and other complexities. So the question is: what is the compartment size where local averaging is possible? People disagree. Some think the brain has organized as itself to be mean-field modelable, but they have never shown much evidence for that. Still, at some length-scale (say, ten micrometers) and some time-scale (much faster than electrophysiology), everything will equilibrate” (p. 4).

257.Gerstner and Naud (2009): “Opinions strongly diverge on what constitutes a good model of a neuron” (p. 379). Herz et al. (2006): “Even today, it remains unclear which level of single-cell modeling is appropriate to understand the dynamics and computations carried out by such large systems (p. 83-4). Kriegeskorte (2015): “Opinions diverge as to whether more biologically detailed models will ultimately be needed” (see section: “What is meant by the term neural network?”). Gabriel Kreiman, in this talk (8:00): “What’s the exact resolution at which we should study neural systems is a fundamental open question, we don’t know what’s the right level of abstraction. There are people who think about brains in the context of blood flow and millions and millions of neurons averaged together. There are people who think we need to actually pay attention to the exact details of how every single dendrite integrates information and so on. For many of us this is a sufficient level of abstraction, the notion that there is a neuron that can integrate information.” Dayan and Abbott (2001): “It is often difficult to identify the appropriate level of modeling for a particular problem” (p. xiii). See also the non-verbatim notes from Open Philanthropy’s non-verbatim notes from a conversation with Prof. E.J. Chichilnisky: “Discussion of the compute sufficient to replicate the brain’s information-processing is very speculative. We don’t know enough about the brain to give answers with confidence, and different people with neuroscientific expertise will answer differently” (p. 1); from Open Philanthropy’s non-verbatim notes from a conversation with Prof. Barak Pearlmutter: “Mr. Carlsmith asked Prof. Pearlmutter about his views about the level of modeling detail necessary to create brain models that can replicate task performance. Prof. Pearlmutter suggested that “the truth is: we don’t know,” and that while we may have intuitions, science has shown us that intuitions are not very reliable” (p. 1). See also Open Philanthropy’s non-verbatim notes from a conversation with Prof. Konrad Kording, Prof. Eric Jonas, and Prof. Erik De Schutter.

258.From Open Philanthropy’s non-verbatim notes from a conversation with Prof. Erik De Schutter: “Modeling neural networks at the level of simple spiking neuron models or rate-based models is very popular. Prof. De Schutter thinks the field would benefit from a greater diversity of approaches” (p. 2); from Open Philanthropy’s non-verbatim notes from a conversation with Prof. Shaul Druckmann “The field has basically given up on detailed biophysical modeling. In the 1990s, there were many papers in top journals on the topic, but now there are almost none. Prof. Druckmann expects that the large majority of people who do not work in early sensory systems would say that detailed biophysical modeling is unnecessary for understanding the brain’s computation” (p. 7).

259.Herz et al. (2006): “The appropriate level of description depends on the particular goal of the model. Indeed, finding the best abstraction level is often the key to success” (p. 80). Pozzorini et al. (2015): “Detailed biophysical models with stochastic ion channel dynamics can in principle account for every aspect of single-neuron activity; however, due to their complexity, they require high computational power… Overall, a reliable and efficient fitting procedure for detailed biophysical models is not known” (p. 2). Izhikevich (2004): “The [Hodkin-Huxley] model is extremely expensive to implement… one can use the Hodgkin–Huxley formalism only to simulate a small number of neurons or when simulation time is not an issue” (p. 1069). Dayan and Abbott (2001): “A frequent mistake is to assume that a more detailed model is necessarily superior. Because models act as bridges between levels of understanding, they must be detailed enough to make contact with the lower level yet simple enough to provide clear results at the higher level” (p. xiii). Beniaguev et al. (2019): “Simulation of compartmental models entails numerically solving thousands of coupled nonlinear differential equations which is computationally intensive (Segev and Rall (1998); Burke (2000)). Moreover, while the simulation provides good fit to data, it is not optimized for providing conceptual understanding of the process by which it is achieved” (p. 14). Kobayashi et al. (2009): ‘It has recently become possible to use elaborate simulation platforms, such as NEURON (Hines and Carnevale (1997)) and GENESIS (Bower and Beeman (1995)), for reproducing experimental data. Because of nonlinearity and complexity, however, parameter optimization of the HH type models is a notoriously difficult problem (Achard and De Schutter (2006); Goldman et al. (2001); Huys et al. (2006)), and these models require a high computational cost, which hinders performing the simulation of a massively interconnected network” (p. 1).

260.From Open Philanthropy’s non-verbatim notes from a conversation with Prof. Erik De Schutter: “The best way forward is to try to explore and understand the function of the brain’s underlying mechanisms – a project that may eventually lead to an understanding of what can be simplified. But to try to simplify things too early, before you understand them, is a dangerous game” (p. 1); from Open Philanthropy’s non-verbatim notes from a conversation with Dr. Stephen Larson: “OpenWorm’s approach is to throw as much complexity into the neuron models as they think is necessary (this is currently roughly at the level of a Hodgkin-Huxley model, plus some additional features), in an effort to really nail down that their model is capturing the worm’s behavior across many conditions and timescales. Success in such a project would allow you to bound the complexity necessary for such a simulation (indeed, this is one of Dr. Larson’s motivations for working on it). After that, you could attempt to simplify the model in a principled way. However, the jury is still out on how much simplification is available, and Dr. Larson thinks that in this kind of uncertain context, you should focus on the worst-case, most conservative compute estimates as your default” (p. 2).

261.From Open Philanthropy’s non-verbatim notes from a conversation with Prof. Anthony Zador (p. 1-2):

Prof. Zador believes that integrate-and-fire neuron models, or something like them, are adequate to capture the contribution of a neuron to the brain’s information-processing. He does not think that Hodgkin-Huxley-type models are required, or that we need to include the details of synaptic conductances in our models. However, he believes that the temporal dynamics of spiking are important. That is, it matters that there are discrete spikes, occurring at particular moments in time, which are the conduit of information between neurons…That said, he does not think that the nuances of how these spikes are generated matter very much. The integrate and fire model is one mathematically tractable model, but there are others which, if more mathematically tractable, would be fine as well.

From Open Philanthropy’s non-verbatim notes from a conversation with Prof. Dong Song (p. 1-2):

In his view, to replicate intelligence at a level similar to humans (as opposed to some more detailed level of simulation accuracy), you don’t need to model quantum phenomena, or ionic channels, or even Hodgkin-Huxley-level dynamics. Rather, a spiking neuron model, with a rich array of input-output behavior, is sufficient. That said, certain simplified spiking neuron models are probably not sufficient. These included linear integrate-and-fire neurons, the Izhikevich model (a simplified version of the Hodgkin-Huxley model), and the models used in Prof. Song’s MIMO model.

Prof. Chris Eliasmith, whose large-scale brain model SPAUN uses leaky-integrate-and-fire neurons (see p. 16 here), thought such neuron models likely adequate for task-performance (see Open Philanthropy’s non-verbatim notes from a conversation with Prof. Chris Eliasmith (p. 5)):

Prof. Eliasmith thinks that neuron models at roughly the level of detail he uses in SPAUN (possibly including some non-linearities in the dendrites), if scaled up to the size of the brain as a whole, would be able not just to replicate cognitive performance, but also to reflect a functional profile similar to biological neurons.

From Open Philanthropy’s non-verbatim notes from a conversation with Prof. Markus Meister (p.1-4):

The computations performed in the retina are fairly well-understood… If your goal is to predict the spiking outputs of the retina, you don’t need a highly intricate model (for example, you don’t have to simulate the details of every neuron using multi-compartmental models). Rather, you can use very compact models known as “point neuron models,” which you can connect together with simple synapses.… To create a functional model of the whole retina, in the extreme case you’d need a point-neuron model for every cell. However, you can probably get away with less than that, because there are a lot of regularities that can be simplified computationally.… Prof. Meister would be sympathetic to scaling up from the retina as a way of putting an upper limit on the difficulty of simulating the brain as a whole. Prof. Meister has not actually done this back-of-the-envelope calculation, but budgeting based on the rate at which action potentials arrive at synapses, multiplied by the number of synapses, seems like roughly the right approach. … There is evidence that single point neuron models are not sufficient to explain all neural phenomena. For example, in cortical pyramidal cells, the basal dendrites and soma operate with different dynamics than the apical tuft. Using two point-neuron models (one for the soma, and another for the apical tuft), you can capture this fairly well. These are more powerful models, but they are not dramatically more computationally complex: e.g., it’s basically a factor of two.

From Open Philanthropy’s non-verbatim notes from a conversation with Dr. Stephen Baccus (p. 5):

To build a functional computational model of the retina as a whole, you could use a linear filter and a threshold as a model unit, and you could have something like one model unit per cell in the retina. However, in some of Prof. Baccus’s models, they have less than this. Whether you’d need e.g. one model unit for every interneuron, or one for every two or three interneurons, isn’t clear, but it’s around that order of magnitude. Prof. Baccus does not think simulating more complex aspects of neuron biology, like dendrites, compartments and ion channels, would be necessary for replicating the retina’s input-output relationship…Prof. Baccus thinks the answer is “maybe” to the question of whether the compute necessary to model neurons in the retina will be similar to the compute necessary to model neurons in the cortex. You might expect a volume by volume comparison to work as a method of scaling up from the retina to the cortex.

Dr. Adam Marblestone offered an estimate that seemed to assume that firing decisions would be in the noise. From Open Philanthropy’s non-verbatim notes from a conversation with Dr. Adam Marblestone (p. 9):

Dr. Marblestone is fairly comfortable with one FLOP per spike through synapse as a low-end estimate, and ~100 FLOPs per spike through synapse (roughly comparable to the estimate offered by Prof. Rahul Sarpeshkar) as a high-end estimate. His best guess is 10-100 FLOPs per spike through synapse.

Prof. Barak Pearlmutter said something similar, and he was sympathetic to the idea that dendritic computation would add only a small constant factor. From Open Philanthropy’s non-verbatim notes from a conversation with Prof. Barak Pearlmutter (p. 2-4):

Prof. Pearlmutter thought that the compute for firing decisions would be “in the noise” relative to compute for spikes through synapses, because there are so many fewer neurons than synapses… Prof. Pearlmutter thought it a fairly good intuition that dendritic computation would only implicate a small constant factor increase in required compute, though very complicated local interactions could introduce uncertainty… Overall, Prof. Pearlmutter thought that an estimate based on 100 FLOPs per spike through synapse, with a factor of two for learning, sounded fairly reasonable.

262.A number of experts we engaged with indicated that many in the field are sympathetic to the adequacy of models less compute-intensive than single-compartment Hodgkin-Huxley (though we have very few comments in this respect publicly documented), and it fits with my impressions more broadly. See also Open Philanthropy’s non-verbatim notes from a conversation with Prof. Shaul Druckmann “The field has basically given up on detailed biophysical modeling. In the 1990s, there were many papers in top journals on the topic, but now there are almost none. Prof. Druckmann expects that the large majority of people who do not work in early sensory systems would say that detailed biophysical modeling is unnecessary for understanding the brain’s computation” (p. 7) (though whether Hodgkin-Huxley would fall under “detailed” biophysical modeling isn’t totally clear to me).

263.Jonathan Pillow says in a lecture: “Obviously if I simulate the entire brain using multi-compartment Hodkin-Huxley models that describe the opening and closing of every channel, clearly that model has the capacity to do anything that the brain can do” (16:10). Pozzorini et al. (2015) write: “Detailed biophysical models with stochastic ion channel dynamics can in principle account for every aspect of single-neuron activity” (p. 2). Beniaguev et al. (2019): “Thanks to the introduction of compartmental models (Rall (1964)) and digital anatomical reconstructions, we can now account for nearly all those experimental phenomena, as well as explore conditions that are not accessible with current experimental technique. In that sense we have developed along the last 50 or so years a faithful model of the input-output transformation of neurons” (p. 14).

264.Workshop participants included: John Fiala, Robin Hanson, Kenneth Jeffrey Hayworth, Todd Huffman, Eugene Leitl, Bruce McCormick, Ralph Merkle, Toby Ord, Peter Passaro, Nick Shackel, Randall A. Koene, Robert A. Freitas Jr and Rebecca Roache. From a brief google, a number of these people appear to be involved in the Brain Preservation Foundation, and some (such as Toby Ord and Rebecca Roache) are philosophers rather than neuroscientists. Sandberg and Bostrom (2008): “An informal poll among workshop attendees produced a range of estimates of the required resolution for WBE is. The consensus appeared to be level 4‐6. Two participants were more optimistic about high level models, while two suggested that elements on level 8‐9 may be necessary at least initially (but that the bulk of mature emulation, once the basics were understood, could occur on level 4‐5).” (p 14).

265.From Open Philanthropy’s non-verbatim notes from a conversation with Dr. Stephen Larson (p. 5):

On the basis of his experience at OpenWorm thus far, Dr. Larson thinks it unlikely that very simplified neuron models (e.g., integrate-and-fire neurons, or models akin to the artificial neurons used in deep neural networks) are going to be sufficient to describe the information-processing dynamics involved in the worm’s behavior…. Dr. Larson does not think that there is strong evidence that spikes and synaptic inputs are the most informative processes for studying information-processing in the brain… Given the many uncertainties involved in estimates of this kind, Dr. Larson believes that the right conclusion is something like: there is insufficient evidence to justify concluding anything (as opposed to, e.g., “there is some moderate evidence in favor of X FLOP/s, so maybe let’s believe that?”). In statistics, for example, one wants a P value less than 0.05, and Dr. Larson is not sure we have anything like that for these FLOP/s estimates.

From Open Philanthropy’s non-verbatim notes from a conversation with Prof. Erik De Schutter:

Prof. De Schutter thinks that at this point, we simply are not in a position to place any limits on the level of biological detail that might be relevant to replicating the brain’s task-performance. Many common simplifications do not have solid scientific foundations, and are more at the level of ‘the way we do things.’

From Open Philanthropy’s non-verbatim notes from a conversation with Prof. Eric Jonas (p. 6):

many electrophysiologists would say that we don’t know what neurons are doing. And they would ask: how can we start making claims about the computational capacity of networks of neurons, if we don’t know how individual neurons work? Prof. Jonas is sympathetic to this. There are a variety of complexities that make the computations performed by a neuron extremely difficult to quantify. Examples include: dendritic spiking, the complex dynamics present in synapses (including large numbers of non-linearities), the diversity of ion-channel receptors, post-translational modification, alternative splicing, and various receptor trafficking regimes. Some people attempt to draw comparisons between neurons and transistors. However, even with a billion transistors, Prof. Jonas does not know how to create a reasonable simulation of a neuron.

From Open Philanthropy’s non-verbatim notes from a conversation with Prof. Konrad Kording (p. 4):

Examination of neurons reveals that they are actually very non-linear, and the computations involved in plasticity probably include a large number of factors distributed across the cell. In this sense, a neuron might be equivalent to a three-layer neural network, internally trained using backpropagation. In that case, you’d need to add another factor of roughly 10⁵ to your compute estimate, for a total of 10²⁰ multiplications per second. This would be much less manageable… The difference between the estimates generated by these different approaches is very large – something like ten orders of magnitude. It’s unclear where the brain is on that spectrum … Prof. Kording’s hunch is that in order to replicate firing decisions in neurons, you’d need to break the neuron into pieces of something like ten microns (this would hundreds, maybe thousands of compartments per neuron). This hunch is grounded in a belief that neurons are very non-linear.

From Open Philanthropy’s non-verbatim notes from a conversation with Prof. Shaul Druckmann (p. 3):

We can distinguish between two approaches to the brain’s biophysical complexity. One camp argues: ‘let’s not assume we need to include a given type of biophysical complexity in our models, until doing so becomes necessary.’ The other argues: ‘If this complexity were in fact important, we would not currently be able to tell.’ Prof. Druckmann tends to be in this latter camp, though he thinks that the former is a fair and practical approach.

Though note that:

Prof. Druckmann would be extremely surprised if future working models of human intelligence incorporate large amounts of biophysical detail (e.g., molecular dynamics). He is confident that the type of non-linearities generated by real biophysics can be more efficiently emulated in different ways in a model. Therefore, these models will look more like giant networks of simple artificial neurons than giant networks of Hodgkin-Huxley models.

266.From Open Philanthropy’s non-verbatim notes from a conversation with Prof. Erik De Schutter:

Many common simplifications do not have solid scientific foundations, and are more at the level of ‘the way we do things.’

From Open Philanthropy’s non-verbatim notes from a conversation with Prof. Konrad Kording (p. 5):

In general, people are often willing to take a philosophical position, without much evidence, if it makes their research more important.

From Open Philanthropy’s non-verbatim notes from a conversation with Prof. Anthony Zador (p. 5):

Prof. Zador’s views about the relative importance of different neural mechanisms are shaped centrally by gut feeling and scientific aesthetic. Neuroscientists have debated this issue for decades, and ultimately the proof is in the pudding. Prof. Zador expects that a lot of neuroscientists would say that just we don’t know what amount of compute would be required to match human-level task performance. There is also a wide diversity of views in the field, and many people’s views are centrally shaped by their research background. For example, people with backgrounds in biology are generally more excited about incorporating biological detail; people who study humans tend to focus on the importance of learning; and people who study small animals will like C. elegans or fruit flies focus less on learning and more on innate behaviors.

From Open Philanthropy’s non-verbatim notes from a conversation with Prof. Dong Song (p. 2):

It would be hard for Prof. Song to prove his view.

From Open Philanthropy’s non-verbatim notes from a conversation with Prof. Barak Pearlmutter (p. 1):

Prof. Pearlmutter suggested that ‘the truth is: we don’t know,’ and that while we may have intuitions, science has shown us that intuitions are not very reliable.

From Open Philanthropy’s non-verbatim notes from a conversation with Prof. E.J. Chichilnisky (p. 2):

no one has been able to prove one way or another whether detailed biophysical modeling is necessary. It’s hard to know, and there isn’t a lot of evidence. There are high-quality experimental and computational efforts underway to understand this…People’s views about the right level of biophysical detail to focus on are sometimes shaped by what they’re good at (e.g., computational simplifications, vs. detailed biophysical analysis). And some people find just biophysical complexity intrinsically interesting.

From Open Philanthropy’s non-verbatim notes from a conversation with Prof. Shaul Druckmann (p. 6):

Prof. Druckmann believes that at our current conceptual understanding of neural computation, many statements in neuroscience to the effect that “we can reduce X to Y” are based mostly on personal opinion, sometimes influenced in part by what current technology allows us to do, rather than in well-justified, first-principles reasoning.

267.From Open Philanthropy’s non-verbatim notes from a conversation with Dr. Paul Christiano:”A ReLU costs less than a FLOP. Indeed, it can be performed with many fewer transistors than a multiply of equivalent precision” (p. 6).

268.This number is just a ballpark for lower temporal resolutions. For example, it’s the resolution used by Maheswaranathan et al. (2019).

269.Izhikevich (2004), (p. 1068).

270.Izhikevich (2004) seems to be assuming at least 1000 time-steps per second: “It takes only 13 floating point operations to simulate 1 ms of the model, so it is quite efficient in large-scale simulations of cortical networks. When and (a,b,c,d) = (0.2, 2, -56, -16) and I = -99, the model has chaotic spiking activity, though the integration time step [here Izhikevich uses a symbol that google doc endnotes can’t reproduce] should be small to achieve adequate numerical precision” (p. 1068).

271.Izhikevich (2004), (p. 1069).

272.The FLOPs estimate for the Hodgkin-Huxley model given in Izhikevich (2004) appears to assume at least 10,000 timesteps/sec: “It takes 120 floating point operations to evaluate 0.1 ms of model time (assuming that each exponent takes only ten operations), hence, 1200 operations/1 ms” (p. 1069). I’m not entirely confident that the “.1 ms of model time” Izhikevich is referring to corresponds with a .1 ms time-step, but this fits with with his characterization of the model as consisting of tens of parameters and requiring at least 10 FLOPs for each exponent. And regardless, it seems unlikely that he has time-steps larger than .1 ms in mind, given that he budgets based on .1 ms increments.

273.Here’s my estimate, which the lead author of the paper tells me looks about right. 1st layer: 1278 synaptic inputs × 35 × 128 = 5.7 million MACCs (from line 140 and lines 179-180 here); Next 6 layers: 6 layers × 128 × 35 × 128 = 3.4 million MACCs. Total per ms: ~ 10 million MACCs. Total per second: ~10 billion MACCs. Multiplied by 2 to count individual FLOPs (see “It’s dot products all the way down” here) = ~20 billion FLOP/s per cell. Though the authors also note that “the accuracy of the model was insensitive to the temporal kernel sizes of the different DNN layers when keeping the total temporal extent of the entire network fixed, so the temporal extent of the first layer was selected to be larger than subsequent layers mainly for visualization purposes” (p. 7). I’m not sure what kind of difference this might make.

274.This is a very loose estimate, based on scaling up the estimate for the Beniaguev et al. (2020) DNN by ~1000x, on the basis of their reporting, in the 2019 version of the paper, that “In our tests we obtained a factor of ~2000 speed up when using the DNN instead of its compartmental-model counterpart” (p. 15). In the current paper they report a “a speedup of simulation time by several orders of magnitude” (p. 8).

275.This is somewhat analogous to the approach taken by Ananthanarayanan et al. (2009): “The basic algorithm of our cortical simulator C2 [2] is that neurons are simulated in a clock-driven fashion whereas synapses are simulated in an event-driven fashion. For every neuron, at every simulation time step (say 1 ms), we update the state of each neuron, and if the neuron fires, generate an event for each synapse that the neuron is post-synaptic to and presynaptic to. For every synapse, when it receives a pre- or post-synaptic event, we update its state and, if necessary, the state of the post-synaptic neuron” (p. 3, Section 3).”

276.From Open Philanthropy’s non-verbatim notes from a conversation with Dr. Paul Christiano: “Dr. Christiano expects that in modeling a neuron’s input-output function, one would not need to compute, every time-step, whether or not the neuron fires during that time-step. Rather, you could accumulate information about the inputs to a neuron over a longer period, and then compute the timing of its spikes over that period all at once. This definitely holds in a purely feedforward context – e.g., for a given neuron, you could simply compute all of the times that the neuron fires, and then use this information to compute when all of the downstream neurons fire, and so on. The fact that the brain’s architecture is highly recurrent complicates this picture, as the firing pattern of a particular neuron may be able to influence the inputs that that same neuron receives. However, the time it takes for an action potential to propagate would be a lower bound on how long it would be possible to wait in accumulating synaptic inputs (since the timescale of a neuron’s influence on its own inputs is capped by the propagation time of its outgoing signals)” (p. 6).

277.Sarpeshkar (2010) employs what appears to be a single-compartment Hodgkin-Huxley model of firing decisions as a lower bound (he cites Izhikevich (2004), and uses an estimate of 1200 FLOPs per firing decision – the number that Izhikevich gives for running a Hodgkin-Huxley model for one ms (see p. 1066)), but he assumes that the model only needs to be “run” every time a neuron spikes (he uses a 5 Hz average rate) (p. 747-8). My intuition, though, would’ve been that because you do not know ahead of time whether or not the synaptic inputs are sufficient to cause an action potential, you would need to calculate this more often than spiking actually occurs.

278.From Open Philanthropy’s non-verbatim notes from a conversation with Prof. Eve Marder: “the computational power necessary to run e.g. a full Hodgkin-Huxley model depends a lot on implementation: e.g., what platform you use, what language you’re using, what method of integration, and what time-step for integration (all of your compute time goes to integrations)” (p. 4-5).

279.See Hansel et al. (1998): “It is shown that very small time steps are required to reproduce correctly the synchronization properties of large networks of integrate-and-fire neurons when the differential system describing their dynamics is integrated with the standard Euler or second-order Runge-Kutta algorithms” (p. 467) … An integration time step of t = 0.001 ms is actually required to evaluate correctly the coherence of the network in this regime” (p. ). Thanks to the expert who pointed me to this paper.

280.From Open Philanthropy’s non-verbatim notes from a conversation with Prof. Chris Eliasmith: “Prof. Eliasmith typically uses 1 ms time-steps in the simulations he builds” (p. 3); and Eliasmith et al. (2012) use leaky-Integrate-and-fire models (see p. 16 of the supplementary materials). Izhikevich (2004) reports various types of collective neuron behavior in simulations using his 13 FLOP/ms model at 1 ms resolution, and others for a different simulation at 0.5 ms for neuron simulation and 1 ms for synaptic dynamics (see Izhikevich et al. (2004), “Neuronal Dynamics”). Ananthanarayanan et al. (2009) use 0.1-1 ms (see p. 3, Section 3.1.1) for “single-compartment phenomenological spiking neurons” (they cite Izhikevich et al. (2004), which suggests to me that they are using Izhikevich models as well).

281.It’s based on scaling up the estimate for the Beniaguev et al. (2020) DNN by ~1000x, on the basis of their reporting, in the 2019 version of the paper, that “In our tests we obtained a factor of ~2000 speed up when using the DNN instead of its compartmental-model counterpart” (p. 15). In the current paper they report a “a speedup of simulation time by several orders of magnitude” (p. 8).

282.From Open Philanthropy’s non-verbatim notes from a conversation with Dr. Adam Marblestone: “It might be that all of the neurons and synapses in the brain are there in order to make the brain more likely to converge on a solution while learning, but that once learning has taken place, the brain implements a function that can be adequately approximated using much less compute” (p. 7).

283.Tsodyks and Wu (2013): “Compared with long-term plasticity (Bi and Poo (2001)), which is hypothesized as the neural substrate for experience-dependent modification of neural circuit, STP has a shorter time scale, typically on the order of hundreds to thousands of milliseconds.” See also Ghanbari et al. (2017), (p. 1), Bliss and Lømo (1973), and Citri and Malenka (2008). It is also possible to break these categories down more finely. Clopath (2012), for example, writes: “A change in synaptic strength can last for different lengths of time: we speak about short-term plasticity when the change lasts up to a few minutes, early-long-term plasticity when it lasts up to a few hours and late-long-term plasticity when it lasts beyond the experiment’s duration (which is often about 10 h) but is thought to last much longer even, possibly a life-time. This last type of plasticity is also called synaptic consolidation or maintenance” (p. 251). Sandberg and Bostrom (2008) suggest that short-term synaptic plasticity “likely plays a role in a variety of brain functions, such as temporal filtering (Fortune and Rose (2001)), auditory processing (Macleod, Horiuchi et al. (2007)) and motor control (Nadim and Manor (2000))” (p. 32). Types of synaptic plasticity can be further subdivided according to whether the relevant change increases (“facilitation”/”potentiation”) or decreases (“depression”) the size of the post-synaptic impact of a spike through that synapse: see Tosdyks and Wu (2013) and Yang and Calakos (2013).

284.Cudmore and Desai (2008): “Intrinsic plasticity is the persistent modification of a neuron’s intrinsic electrical properties by neuronal or synaptic activity. It is mediated by changes in the expression level or biophysical properties of ion channels in the membrane, and can affect such diverse processes as synaptic integration, subthreshold signal propagation, spike generation, spike backpropagation, and meta-plasticity.” Indeed, it has been shown that a type of neuron in the cerebellum known as a cerebellar Purjinke cell can learn timed responses to inputs in a manner that does not rely on synaptic plasticity. Johansson et al. (2014): “The standard view of the mechanisms underlying learning is that they involve strengthening or weakening synaptic connections. Learned response timing is thought to combine such plasticity with temporally patterned inputs to the neuron. We show here that a cerebellar Purkinje cell in a ferret can learn to respond to a specific input with a temporal pattern of activity consisting of temporally specific increases and decreases in firing over hundreds of milliseconds without a temporally patterned input. Training Purkinje cells with direct stimulation of immediate afferents, the parallel fibers, and pharmacological blocking of interneurons shows that the timing mechanism is intrinsic to the cell itself. Purkinje cells can learn to respond not only with increased or decreased firing but also with an adaptively timed activity pattern” (p. 14930).

285.See e.g. Munno and Syed (2003), Ming and Song (2011), Grutzendler et al. (2002), Holtmaat et al. (2005).

286.See e.g. Markram et al. (1997).

287.See Luscher and Malenka (2012).

288.See e.g. Gerstner et al. (2018), and Nadim and Bucher (2014).

289.See Monday et al. (2018) (p. 7-8).

290.See Tao and Poo (2001).

291.See Yap and Greenberg (2018).

292.See Bhalla (2014), Figure 1, for a diagram depicting some of this machinery.

293.From Open Philanthropy’s non-verbatim notes from a conversation with Prof. Blake Richards: “Some neuroscientists are interested in the possibility that a lot of computation is occurring via molecular processes in the brain. For example, very complex interactions could be occurring in a structure known as the post-synaptic density, which involves molecular machinery that could in principle implicate many orders of magnitude of additional compute per synapse. We don’t yet know what this molecular machinery is doing, because we aren’t yet able to track the states of the synapses and molecules with adequate precision. There is evidence that perturbing the molecular processes within the synapse alters the dynamics of synaptic plasticity, but this doesn’t necessarily provide much evidence about whether these processes are playing a computational role. For example, their primary role might just be to maintain and control a single synaptic weight, which is itself a substantive task for a biological system” (p. 2). Monday et al. (2018): ‘The cellular basis of learning and memory is one of the greatest unsolved mysteries in neuroscience … Despite significant advancements in the molecular basis of neurotransmission, exactly how transmitter release is modified in a long-term manner remains largely unclear” (p. 1-2).

294.Lahiri and Ganguli (2013): Lahiri and Ganguli (2013): “To understand the functional contribution of such molecular complexity to learning and memory, it is essential to expand our theoretical conception of a synapse from a single scalar to an entire dynamical system with many internal molecular functional states” (p. 1). Benna and Fusi (2016): “The molecular machinery responsible for memory consolidation at the level of synaptic connections is believed to employ a complex network of diverse biochemical processes that operate on different timescales. Understanding how these processes are orchestrated to preserve memories over a lifetime requires guiding principles to interpret the complex organization of the observed synaptic molecular interactions and explain its computational advantage. Here we present a class of synaptic models that can efficiently harness biological complexity to store and preserve a huge number of memories on long timescales, vastly outperforming all previous synaptic models of memory” (p. 1697). Kaplanis et al. (2018): “we show that by equipping tabular and deep reinforcement learning agents with a synaptic model that incorporates this biological complexity (Benna and Fusi (2016)), catastrophic forgetting can be mitigated at multiple timescales. In particular, we find that as well as enabling continual learning across sequential training of two simple tasks, it can also be used to overcome within-task forgetting by reducing the need for an experience replay database” (p. 1). Zenke et al. (2017): “In this study, we introduce intelligent synapses that bring some of this biological complexity into artificial neural networks. Each synapse accumulates task relevant information over time, and exploits this information to rapidly store new memories without forgetting old ones. We evaluate our approach on continual learning of classification tasks, and show that it dramatically reduces forgetting while maintaining computational efficiency” (abstract).

295.Activity-dependent myelination might be one example (see e.g. Faria et al. (2019)).

296.Though short-term plasticity is both (a) fairly fast and (b) possibly involved in working memory, which many tasks require. See also Sandberg and Bostrom (2008): “Since neurogenesis occurs on fairly slow timescales (> 1 week) compared to brain activity and normal plasticity, it could probably be ignored in brain emulation if the goal is an emulation that is intended to function faithfully for only a few days and not to exhibit truly long‐term memory consolidation or adaptation” (p. 35).

297.Sorrells et al. (2018): “In humans, some studies have suggested that hundreds of new neurons are added to the adult dentate gyrus every day, whereas other studies find many fewer putative new neurons.” See also Moreno-Jimenez et al. (2019): “we identified thousands of immature neurons in the DG of neurologically healthy human subjects up to the ninth decade of life” (abstract).

298.Zuo et al. (2005): “In adult mice (4-6 months old), 3%-5% of spines were eliminated and formed over 2 weeks in various cortical regions. Over 18 months, only 26% of spines were eliminated and 19% formed in adult barrel cortex” (from the abstract). From Open Philanthropy’s non-verbatim notes from a conversation with Prof. Erik De Schutter: “Networks of neurons can rewire themselves fairly quickly, over timescales of tens of minutes. These changes correlate with improvements in performance on tasks” (p. 3).

299.Dr. Dario Amodei suggested considerations in this vein.

300.See e.g. this diagram of a potentiated synapse, illustrating an increased number of post-synaptic receptors

301.Thus, for example, Bliss and Lømo (1973), in an early result related to long-lasting synaptic potentiation, use conditioning spike trains of 10-15 secs, and 3-4 seconds (p. 331).

302.See discussion of the “stability – plasticity dilemma,” e.g. Mermillod et al. (2013). One possible solution is to use multiple dynamical variables operating on different timescales – see Benna and Fusi (2016).

303.Koch (1999): “An important distinction between ionotropic and metabotropic receptors is their time scale. While members of the former class act rapidly, terminating within a very small fraction of a second, the speed of the latter class is limited by diffusion. Biochemical reactions can happen nearly instantaneously at the neuronal time scale. However, if a synaptic input to a metabotropic receptor induces the release of some messenger, such as calcium ions, which have to diffuse to the cell body in order to ‘do their thing,’ the time scale is extended to seconds or longer “ (p. 95). See also Siegelbaum et al. (2013b): “whereas the action of ionotropic receptors is fast and brief, metabotropic receptors produce effects that begin slowly and persist for long periods, ranging from hundreds of milliseconds to many minutes” (p. 236).

304.See p. 32. Bhalla (2014) also suggests that chemical computation involves 1e6 “computations per second” per neuron.

305.Yap and Greenberg (2018): “Discovered by Greenberg and Ziff in 1984 (Greenberg and Ziff (1984)), the rapid and transient induction of Fos transcription provided the first evidence that mammalian cells could respond to the outside world within minutes by means of rapid gene transcription, in particular through the activation of specific genes (Cochran et al. (1984); Greenberg et al. (1985); Greenberg et al. (1986); Kruijer et al. (1984); Lau and Nathans (1987); Müller et al. (1984))” (p. 331).

306.Indeed, certain models of synaptic plasticity explicitly include variables whose state is not immediately expressed in changes to synaptic efficacy (that is, in the size of the effect that a spike through that synapse has on a downstream neuron). See e.g. three-factor learning rules discussed by Gerstner et al. (2018). From Open Philanthropy’s non-verbatim notes from a conversation with Dr. Adam Marblestone: “Compute increases are more likely to come from synaptic decisions that get computed on something like a per-spike basis. For example, you might need to do a lot of fast computation in order to set the synaptic “flag” variables involved in some neo-Hebbian three-factor learning rules, even if these variables take a long time to have effects” (p. 3).

307.Tsodyks and Wu (2013): “Compared with long-term plasticity (Bi and Poo (2001)), which is hypothesized as the neural substrate for experience-dependent modification of neural circuit, STP has a shorter time scale, typically on the order of hundreds to thousands of milliseconds.” Cheng et al. (2018): “It is well established that both augmentation and potentiation are triggered by a transient rise in calcium concentration within the presynaptic terminal.”

308.From Open Philanthropy’s non-verbatim notes from a conversation with Prof. Blake Richards: “it is very difficult to say at this point exactly how much compute would be required to model learning in the brain, because there is a lot of disagreement in the field as to how sophisticated the learning algorithms in the brain are. This is partly because we don’t have a good hold on how much human learning is truly general purpose, vs. constrained to particular tasks” (p. 1).

309.See Yann LeCun’s 2017 talk: “How does the brain learn so much so quickly?”, and Stuart Russell’s comments here: “I think another area where deep learning is clearly not capturing the human capacity for learning, is just in the efficiency of learning. I remember in the mid ’80s going to some classes in psychology at Stanford, and there were people doing machine learning then and they were very proud of their results, and somebody asked Gordon Bower, “how many examples do humans need to learn this kind of thing?” And Gordon said “one [sic] Sometimes two, usually one”, and this is genuinely true, right? If you look for a picture book that has one to two million pictures of giraffes to teach children what a giraffe is, you won’t find one. Picture books that tell children what giraffes are have one picture of a giraffe, one picture of an elephant, and the child gets it immediately, even though it’s a very crude cartoonish drawing, of a giraffe or an elephant, they never have a problem recognizing giraffes and elephants for the rest of their lives. Deep learning systems are needing, even for these relatively simple concepts, thousands, tens of thousands, millions of examples, and the idea within deep learning seems to be that well, the way we’re going to scale up to more complicated things like learning how to write an email to ask for a job, is that we’ll just have billions or trillions of examples, and then we’ll be able to learn really, really complicated concepts. But of course the universe just doesn’t contain enough data for the machine to learn direct mappings from perceptual inputs or really actually perceptual input history. So imagine your entire video record of your life, and that feeds into the decision about what to do next, and you have to learn that mapping as a supervised learning problem. It’s not even funny how unfeasible that is. The longer the deep learning community persists in this, the worse the pain is going to be when their heads bang into the wall.” That said, work on this topic is ongoing, and these comparisons don’t seem straightforward.

310.SSee e.g., Guerguiev et al. (2017), Bartunov et al. (2018), and Hinton (2011). From Guerguiev et al. (2017): “Backpropagation assigns credit by explicitly using current downstream synaptic connections to calculate synaptic weight updates in earlier layers, commonly termed ‘hidden layers’ (LeCun et al., 2015) (Figure 1B). This technique, which is sometimes referred to as ‘weight transport’, involves non-local transmission of synaptic weight information between layers of the network (Lillicrap et al. (2016); Grossberg (1987)). Weight transport is clearly unrealistic from a biological perspective (Bengio et al. (2015); Crick (1989)). It would require early sensory processing areas (e.g. V1, V2, V4) to have precise information about billions of synaptic connections in downstream circuits (MT, IT, M2, EC, etc.). According to our current understanding, there is no physiological mechanism that could communicate this information in the brain. Some deep learning algorithms utilize purely Hebbian rules (Scellier and Bengio, 2016; Hinton et al. (2006)). But, they depend on feedback synapses that are symmetric to feedforward synapses (Scellier and Bengio, 2016; Hinton et al. (2006)), which is essentially a version of weight transport. Altogether, these artificial aspects of current deep learning solutions to credit assignment have rendered many scientists skeptical of the proposal that deep learning occurs in the real brain (Crick, 1989; Grossberg (1987); Harris (2008); Urbanczik and Senn (2009)). Recent findings have shown that these problems may be surmountable, though. Lillicrap et al. (2016), Lee et al. (2015) and Liao et al. (2015) have demonstrated that it is possible to solve the credit assignment problem even while avoiding weight transport or symmetric feedback weights” (p. 3).

311.See e.g. David Pfau via twitter: “In 100 years, we’ll look back on theories of ‘how the brain does backpropagation’ the way we look at the luminiferous aether now.” See also Open Philanthropy’s non-verbatim notes from a conversation with Prof. Eric Jonas: “Prof. Jonas does not think that there is a clear meaning to the claim that the brain is a deep learning system” (p. 3).

312.See e.g. Gerstner et al. (2018) for some descriptions. From Open Philanthropy’s non-verbatim notes from a conversation with Dr. Adam Marblestone: “A lot of the learning models discussed in neuroscience are also significantly simpler than backpropagation: e.g., three-factor rules like “if the pre-synaptic neuron was active, and the post-synaptic neuron was active, and you had dopamine in the last ~3 seconds, then strengthen” (p. 6). From Open Philanthropy’s non-verbatim notes from a conversation with Prof. Anthony Zador: “We know the general outlines of the rules governing synaptic plasticity. The synapse gets stronger and weaker as a function of pre and post synaptic activity, and external modulation” (p. 3).

313.From Open Philanthropy’s non-verbatim notes from a conversation with Prof. Chris Eliasmith: “In the large scale brain simulations that Chris Eliasmith builds, he often uses an error-driven Hebbian rule, which computes updates to synaptic weights based on pre-synaptic activity, post-synaptic activity, and an error signal (which, in the brain, could proceed via a mechanism like dopamine modulation). This rule requires on the order of three to five operations per synapse (a couple of products, and then a weight update), though the total burden depends on how often you perform the updates” (p. 4).

314.From Open Philanthropy’s non-verbatim notes from a conversation with Prof. Anthony Zador: “We know the general outlines of the rules governing synaptic plasticity. The synapse gets stronger and weaker as a function of pre and post synaptic activity, and external modulation. There is a lot of room for discovery there, and it may be difficult to get just right, but conceptually, it’s pretty simple. Prof. Zador expects it to be possible to capture synaptic plasticity with a small number of FLOPs per spike through synapse” (p. 3).

315.From Open Philanthropy’s non-verbatim notes from a conversation with Prof. Chris Eliasmith: “In the large scale brain simulations that Chris Eliasmith builds, he often uses an error-driven Hebbian rule, which computes updates to synaptic weights based on pre-synaptic activity, post-synaptic activity, and an error signal (which, in the brain, could proceed via a mechanism like dopamine modulation)” (p. 4).

316.Kaplanis et al. (2018) add 30 extra dynamical variables per synapse, but manage to increase runtime by only 1.5-2 times relative to a control model, though I’m not sure about the details here. They note that “the complexity of the algorithm is O(mN), where N is the number of trainable parameters in the network and m is the number of Benna-Fusi variables per parameter.”

317.See e.g. Lahiri and Ganguli (2013): “To understand the functional contribution of such molecular complexity to learning and memory, it is essential to expand our theoretical conception of a synapse from a single scalar to an entire dynamical system with many internal molecular functional states” (p. 1). Benna and Fusi (2016): “The molecular machinery responsible for memory consolidation at the level of synaptic connections is believed to employ a complex network of diverse biochemical processes that operate on different timescales. Understanding how these processes are orchestrated to preserve memories over a lifetime requires guiding principles to interpret the complex organization of the observed synaptic molecular interactions and explain its computational advantage. Here we present a class of synaptic models that can efficiently harness biological complexity to store and preserve a huge number of memories on long timescales, vastly outperforming all previous synaptic models of memory” (p. 1697). My understanding is that Fusi and Abbott (2007) is a precursor to some of this work.

318.From Open Philanthropy’s non-verbatim notes from a conversation with Prof. Blake Richards: “First-order gradient descent methods, like back-propagation, use the slope of the loss function to minimize the loss” (p. 1-2).

319.From Open Philanthropy’s non-verbatim notes from a conversation with Prof. Blake Richards: “[For first-order gradient descent methods], learning is basically a backwards pass through the network, so the compute required scales linearly with the number of neurons and synapses in the network, adding only a small constant factor” (p. 1-2). See also Open Philanthropy’s non-verbatim notes from a conversation with Prof. Barak Pearlmutter: “Prof. Pearlmutter’s best-guess estimate was that the learning overhead (that is, the compute increase from moving from a non-adaptive system to an adaptive system) would be a factor of two. It could be more or less, but this is a number we actually understand, because the existing learning algorithms that we know work for large-scale systems, and that we have put effort into optimizing – for example, backpropagation – implicate roughly this type of overhead” (p. 3).

320.From Open Philanthropy’s non-verbatim notes from a conversation with Prof. Barak Pearlmutter: “Prof. Pearlmutter’s best-guess estimate was that the learning overhead (that is, the compute increase from moving from a non-adaptive system to an adaptive system) would be a factor of two. It could be more or less, but this is a number we actually understand, because the existing learning algorithms that we know work for large-scale systems, and that we have put effort into optimizing – for example, backpropagation – implicate roughly this type of overhead” (p. 3). From Open Philanthropy’s non-verbatim notes from a conversation with Prof. Konrad Kording: “Prof. Kording thinks that learning in the brain requires the same amount of compute as processing. If you have a compute graph, going forwards and backwards comes at roughly the same cost” (p. 3). From Open Philanthropy’s non-verbatim notes from a conversation with Prof. Blake Richards: “Prof. Richards favors the hypothesis that the brain uses a learning method with compute scaling properties similar to backpropagation. This is partly because humans are capable of learning so many tasks that were not present in the evolutionary environment (and hence are unlikely to be hardwired into our brains), with comparatively little data (e.g., less than a weight-perturbation algorithm would require)” (p. 2).

321.From Open Philanthropy’s non-verbatim notes from a conversation with Prof. Blake Richards: “More sophisticated learning algorithms, such as second-order gradient methods, take into account not just the slope of the loss function gradient but also its curvature. These require more compute (the compute per learning step scales as a polynomial with the number of neurons and synapses), which is why people don’t use these techniques, even though they are arguably much better” (p. 2).

322.See previous endnote.

323.From Open Philanthropy’s non-verbatim notes from a conversation with Dr. Paul Christiano: “Based on his understanding of the brain’s physiology, Dr. Christiano thinks it extremely implausible that the brain could be implementing second-order optimization methods” (p. 7).

324.From Open Philanthropy’s non-verbatim notes from a conversation with Dr. Adam Marblestone: “He has not seen proposals for how second-order gradient methods of learning could be implemented in the brain.” (p. 6).

325.From Open Philanthropy’s non-verbatim notes from a conversation with Prof. Blake Richards: “In the other direction, there are algorithms known as “weight-perturbation” or “node-perturbation” algorithms. These involve keeping/consolidating random changes to the network that result in reward, and getting rid of changes that result in punishment (a process akin to updating parameters based on simple signals of “hotter” and “colder”). These algorithms require less compute than first-order gradient descent methods, but they take longer to converge as the size of the network grows. In this sense, they involve trade-offs between compute and time” (p. 2).

326.See previous endnote.

327.From Open Philanthropy’s non-verbatim notes from a conversation with Prof. Blake Richards: “Prof. Richards favors the hypothesis that the brain uses a learning method with compute scaling properties similar to backpropagation. This is partly because humans are capable of learning so many tasks that were not present in the evolutionary environment (and hence are unlikely to be hardwired into our brains), with comparatively little data (e.g., less than a weight-perturbation algorithm would require)” (p. 2).

328.From Open Philanthropy’s non-verbatim notes from a conversation with Dr. Adam Marblestone: “There are also non-gradient methods of learning. For example, some people are interested in Bayesian belief propagation, though Dr. Marblestone is not aware of efforts to describe how this might be implemented at the level of e.g. dendrites. We shouldn’t assume that the brain is doing some sort of gradient-based learning” (p. 6). See also Gütig and Sompolinsky (2006) (though I’m not sure if this would fall into one of the categories above).

329.From Open Philanthropy’s non-verbatim notes from a conversation with Dr. Kate Storrs: “Dr. Storrs’ sense is that, in the parts of the field she engages with most closely (e.g., systems level modeling, visual/cognitive/perceptual modeling, human behavior), and maybe more broadly, a large majority of people treat synaptic weights as the core learned parameters in the brain. That said, she is not a neurophysiologist, and so isn’t the right person to ask about what sort of biophysical complexities could imply larger numbers of parameters. She is peripherally aware of papers suggesting that glia help store knowledge, and there are additional ideas as well. The truth probably involves mechanisms other than synaptic weights, but she believes that the consensus is that such weights hold most of the knowledge” (p. 2).

330.From Open Philanthropy’s non-verbatim notes from a conversation with Prof. Konrad Kording: “Here is one non-standard argument for this degree of non-linearity in neurons. Adjusting synapses in helpful ways requires computing how that synapse should adjust based on its contribution to whether the neuron fires. But this computation applies in basically the same way to individual ion channels in the cell: e.g., if the brain can signal to the synapse how to adjust in order to improve neuron firing, it can do the same for ion channels, at no additional cost. This makes Prof. Kording thinks that the brain is optimizing both. However, current techniques are very bad at measuring ion channel plasticity. Neuroscientists don’t tend to focus on it for this reason. There are considerably more ion channels than synapses, and ion channels change how synapses linearly and nonlinearly interact with one another. This suggests an uglier computational space” (p. 4-5).

331.See p. 494.

332.Sarpeshkar (2010): “Information is always represented by the states of variables in a physical system, whether that system is a sensing, actuating, communicating, controlling, or computing system or a combination of all types. It costs energy to change or to maintain the states of physical variables. These states can be in the voltage of a piezoelectric sensor, in the mechanical displacement of a robot arm, in the current of an antenna, in the chemical concentration of a regulating enzyme in a cell, or in the voltage on a capacitor in a digital processor. Hence, it costs energy to process information, whether that energy is used by enzymes in biology to copy a strand of DNA or in electronics to filter an input. To save energy, one must then reduce the amount of information that one wants to process. The higher the output precision and the higher the temporal bandwidth or speed at which the information needs to be processed, the higher is the rate of energy consumption, i.e., power. To save power, one must then reduce the rate of information processing…The art of low-power design consists of decomposing the task to be solved in an intelligent fashion such that the rate of information processing is reduced as far as is possible without compromising the performance of the system” (p. 9).

333.From Open Philanthropy’s non-verbatim notes from a conversation with Prof. Blake Richards (p. 3):

Based on Prof. Richard’s best guess, it seems reasonable to him to budget an order of magnitude of compute for learning, on top of a budget of roughly one FLOP (possibly a bit more) per spike through synapse. However, it could also be higher or lower.

From Open Philanthropy’s non-verbatim notes from a conversation with Prof. Anthony Zador (p. 3):

Prof. Zador expects it to be possible to capture synaptic plasticity with a small number of FLOPs per spike through synapse.

From Open Philanthropy’s non-verbatim notes from a conversation with Prof. Barak Pearlmutter (p. 4):

Overall, Prof. Pearlmutter thought that an estimate based on 100 FLOPs per spike through synapse, with a factor of two for learning, sounded fairly reasonable.

From Open Philanthropy’s non-verbatim notes from a conversation with Dr. Adam Marblestone (p. 9):

Dr. Marblestone expects that both three-factor rules and backpropagation-type methods would imply compute burdens within an order of magnitude or two of estimates based on 1 FLOP per spike through synapse…Dr. Marblestone is fairly comfortable with one FLOP per spike through synapse as a low-end estimate, and ~100 FLOPs per spike through synapse (roughly comparable to the estimate offered by Prof. Rahul Sarpeshkar) as a high-end estimate. His best guess is 10-100 FLOPs per spike through synapse.

From Open Philanthropy’s non-verbatim notes from a conversation with Prof. Chris Eliasmith (p. 5):

In the large scale brain simulations that Chris Eliasmith builds, he often uses an error-driven Hebbian rule, which computes updates to synaptic weights based on pre-synaptic activity, post-synaptic activity, and an error signal (which, in the brain, could proceed via a mechanism like dopamine modulation). This rule requires on the order of three to five operations per synapse (a couple of products, and then a weight update), though the total burden depends on how often you perform the updates…Prof. Eliasmith thinks that neuron models at roughly the level of detail he uses in SPAUN (possibly including some non-linearities in the dendrites), if scaled up to the size of the brain as a whole, would be able not just to replicate cognitive performance, but also to reflect a functional profile similar to biological neurons.

334.Sarpeshkar (2010): “If we assume that synaptic multiplication is at least one floating-point operation (FLOP), the 20 ms second-order filter impulse response due to each synapse is 40 FLOPS, and that synaptic learning requires at least 10 FLOPS per spike, a synapse implements at least 50 FLOPS of computation per spike” (p. 748-749).

335.From Open Philanthropy’s non-verbatim notes from a conversation with Prof. Eric Jonas: “Prof. Jonas is not convinced by any arguments he’s heard that attempt to limit the amount of state you can store in a neuron. Indeed, some recent work explores the possibility that some information is stored using DNA. If there are actually molecular-level storage mechanisms at work in these systems, that would alter compute estimates by multiple orders of magnitude. … Prof. Jonas thinks that estimating the complexity of learning in the brain involves even more uncertainty than estimates based on firing decisions in neurons. Neuroscientists have been studying things like spike timing dependent plasticity and long-term plasticity for decades, and we can elicit versions of them reliably in vitro. But it’s much harder to understand the actual biological processes occurring in vivo in a behaving animal, because we have so much less experimental access. The machine learning community has multiple theories of the computational complexity of learning. However, these don’t seem to capture the interesting properties of natural systems or existing machine learning systems. … He also has a long-term prior that researchers are too quick to believe that the brain is doing whatever is currently popular in machine learning, and he doesn’t think we’ve found the right paradigm yet” (p. 3-4). One other expert I spoke with was also skeptical/agnostic, though I didn’t do notes from this conversation.

336.From Open Philanthropy’s non-verbatim notes from a conversation with Prof. Konrad Kording: “Here is one non-standard argument for this degree of non-linearity in neurons. Adjusting synapses in helpful ways requires computing how that synapse should adjust based on its contribution to whether the neuron fires. But this computation applies in basically the same way to individual ion channels in the cell: e.g., if the brain can signal to the synapse how to adjust in order to improve neuron firing, it can do the same for ion channels, at no additional cost. This makes Prof. Kording thinks that the brain is optimizing both. However, current techniques are very bad at measuring ion channel plasticity. Neuroscientists don’t tend to focus on it for this reason. There are considerably more ion channels than synapses, and ion channels change how synapses linearly and nonlinearly interact with one another. This suggests an uglier computational space” (p. 4-5).

337.Dr. Dario Amodei emphasized this distinction.

338.A number of experts we engaged with indicated that many computational neuroscientists would not emphasize these other mechanisms very much (though their comments in this respect are not publicly documented); and the experts I interviewed didn’t tend to emphasize such mechanisms either.

339.For example, Dr. Adam Marblestone noted that his own implicit ontology distinguishes between “fast, real-time computation,” – the rough equivalent of “standard neuron signaling” on the categorization I’ve been using – and other processes (see Open Philanthropy’s non-verbatim notes from a conversation with Dr. Adam Marblestone (p. 2)). And Prof. Anthony Zador suggested that processes that proceed on longer timescales won’t add much computational burden (see Open Philanthropy’s non-verbatim notes from a conversation with Prof. Anthony Zador (p. 4)).

340.From Open Philanthropy’s non-verbatim notes from a conversation with Dr. Adam Marblestone: “It’s also hard to rule out the possibility that even though relevant processes (e.g., neuropeptide signaling) are proceeding on slow timescales, there are so many of them, implicating sufficiently many possible states and sufficiently complex interactions, that a lot of compute is required regardless” (p. 3).

341.From Open Philanthropy’s non-verbatim notes from a conversation with Prof. Eve Marder: “Both experimentalists and theorists sometimes act as though there’s a mechanistic wall between short-term, middle-term, and long-term changes in neural systems. This is partly because you have to come up with experiments that will occur over a given timeframe (two hours, two days, two weeks). But that doesn’t mean the time constants of these processes are two hours, two days, two weeks, etc.: it’s just that you designed an experimental protocol that allows you to see the difference between these periods of time. Historically, limitations on computational resources have also played a role in popularizing such separations. In the old days, people were limited by how much they could compute by the timesteps and integrators they were using, so there was tremendous pressure to separate timescales: no one wants to integrate over very long times at the rates you’d need to in order to capture fast dynamics. Thus, for example, people will take a model with eight or ten currents, and try to reduce it by separating timescales. If you’re clever, you can retain various essential features, but it’s hard to know if you’ve got them all. Whether or not such separations between timescales are biologically reasonable, though, they were computationally necessary, and they have resulted in ingrained beliefs in the field. In reality, the nervous system has an incredible ability to move seamlessly between timescales ranging from milliseconds to years, and the relevant processes interact. That is, short time-scale processes influence long time-scale processes, and vice versa. And unlike digital computers, the brain integrates over very long timescales at very fast speeds easily and seamlessly” (p. 2-3). In an ordinary differential equation model, variables that update more slowly might impose comparable FLOP/s costs to faster variables.

342.From Open Philanthropy’s non-verbatim notes from a conversation with Prof. Anthony Zador: “while global signals may be very important to a model’s function, they won’t add much computational burden (the same goes for processes that proceed on longer timescales). It takes fewer bits to specify a global signal, almost by definition” (p. 4). From Open Philanthropy’s non-verbatim notes from a conversation with Prof. Barak Pearlmutter: “He also suggested that ephaptic effects would be ‘in the noise’ because they are bulk effects, representation of which would involve one number that covers thousands of synapses” (p. 3).

343.Leng and Ludwig (2008): Leng and Ludwig (2008): “Classical neurotransmitters are released from axon terminals by Ca²⁺-dependent exocytosis (Burgoyne and Morgan (2003)); they are packaged in small synaptic vesicles which are preferentially localized at synapses, although recent evidence indicates that extrasynaptic vesicular release can also occur from the somato/dendritic regions of neurones (Cheramy et al. (1981); Huang and Neher (1996); Zilberter et al. (2005)). Peptides are also released by Ca²⁺-dependent exocytosis, but they are packaged in large dense-core vesicles which generally are not localized to synapses; some are found at synapses, but these vesicles tend to be distributed in soma, dendrites and in axonal varicosities as well as at nerve endings” (p. 5625). See also Mains and Eipper (1999). Russo (2017): “All neuropeptides act as signal transducers via cell-surface receptors. Nearly all neuropeptides act at G-protein coupled receptors (Figure 2). This is an important distinction from ion channel-coupled receptors, since G-protein coupled signaling is consistent with neuropeptides inducing a slower and modulatory response compared to neurotransmitters. In addition, neuropeptide receptors have relatively high ligand affinities (nanomolar Kds), compared to neurotransmitter receptors. This allows a small amount of diffused peptide to still activate receptors. In summary, the combination of these features allows neuropeptides to be active at relatively large distances at relatively low concentrations” (p. 5). My impression is that neuropeptides can also diffuse through the blood (see Mains and Eipper (1999): “Probably the first neuropeptide to be identified was vasopressin, a nine-amino-acid peptide secreted by the nerve endings in the neural lobe of the pituitary. The source of the vasopressin is the magnocellular neurons of the hypothalamus, which send axons to the neurohypophysis, which is the site of release into the blood, in classic neurosecretory fashion”).

344.See Siegelbaum et al. (2013b), (p. 248), and Alger (2002).

345.Burrows (1996): “A neuromodulator is a messenger released from a neuron in the central nervous system, or in the periphery, that affects groups of neurons, or effector cells that have the appropriate receptors. It may not be released at synaptic sites, often acts through second messengers and can produce long-lasting effects. The release may be local so that only nearby neurons or effectors are influenced, or may be more widespread, which means that the distinction with a neurohormone can become very blurred. The act of neuromodulation, unlike that of neurotransmission, does not necessarily carry excitation of inhibition from one neuron to another, but instead alters either the cellular or synaptic properties of certain neurons so that neurotransmission between them is changed” (p. 195).

346.See e.g. Smith et al. (2019): “Our analysis exposes transcriptomic evidence for dozens of molecularly distinct neuropeptidergic modulatory networks that directly interconnect all cortical neurons.”

347.Koch (1999): “It is difficult to overemphasize the importance of modulatory effects involving complex intracellular biochemical pathways. The sound of stealthy footsteps at night can set our heart to pound, sweat to be released, and all our senses to be at a maximum level of alertness, all actions that are caused by second messengers. They underlie the difference in sleep-wake wake behavior, in affective moods, and in arousal, and they mediate the induction of long-term term memories” (p. 95).

348.Marder (2012): “Because neuromodulators can transform the intrinsic firing properties of circuit neurons and alter effective synaptic strength, neuromodulatory substances reconfigure neuronal circuits, often massively altering their output… the neuromodulatory environment constructs and specifies the functional circuits that give rise to behavior” (abstract).

349.Smith et al. (2019): “secreted neuropeptides are thought to persist long enough (e.g., minutes) in brain interstitial spaces for diffusion to very-high-affinity NP-GPCRs hundreds of micrometers distant from release sites… Though present information is limited, eventual degradation by interstitial peptidases nonetheless probably restricts diffusion of most neuropeptides to sub-millimeter, local circuit distance scales.”

350.This is a point suggested by Dr. Dario Amodei. See also Siegelbaum et al. (2013b): “whereas the action of ionotropic receptors is fast and brief, metabotropic receptors produce effects that begin slowly and persist for long periods, ranging from hundreds of milliseconds to many minutes” (p. 236). Koch (1999) says something similar, attributing the difference at least in part to the time it takes for a second messenger to diffuse through a cell: “An important distinction between ionotropic and metabotropic receptors is their time scale. While members of the former class act rapidly, terminating within a very small fraction of a second, the speed of the latter class is limited by diffusion. Biochemical reactions can happen nearly instantaneously at the neuronal time scale. However, if a synaptic input to a metabotropic receptor induces the release of some messenger, such as calcium ions, which have to diffuse to the cell body in order to ‘do their thing,’ the time scale is extended to seconds or longer” (p. 95). Russo (2017): “All neuropeptides act as signal transducers via cell-surface receptors. Nearly all neuropeptides act at G-protein coupled receptors (Figure 2). This is an important distinction from ion channel-coupled receptors, since G-protein coupled signaling is consistent with neuropeptides inducing a slower and modulatory response compared to neurotransmitters” (p. 5).

351.See the abstract.

352.Leng and Ludwig (2008): “These arguments suggest that, in the neural lobe, exocytosis of a large dense-core vesicle is a surprisingly rare event; at any given nerve terminal, it may take about 400 spikes to release a single vesicle. As these sendings contain far more vesicles than are found at any synapse, synaptic release of peptides generally in the CNS seems likely to occur with a much lower probability of release. Release of oxytocin within the brain from the dendrites of magnocellular neurones is also infrequent, likely to occur at rates of only about 1 vesicle per cell every few seconds. This seems incompatible with the notion of peptides being effective and faithful mediators of information flow at short time scales and with spatial precision…There is clearly a massive qualitative discrepancy between the rates of release of synaptic vesicles and of peptide-containing vesicles … release of a peptide-containing vesicle is a comparatively rare event for any neurone” (p. 5629-5630).

353.Leng and Ludwig (2008): “Peptide-containing vesicles may contain more than 10 times as much cargo (in terms of the number of messenger molecules)…There are no known reuptake mechanisms for the peptides and the vesicles cannot be re-used. Thus release of a peptide-containing vesicle is a comparatively rare event for any neurone, but one with potentially widespread and profound consequences (cf. volume transmission Fuxe et al. 2007)” (p. 5630).

354.From Open Philanthropy’s non-verbatim notes from a conversation with Prof. Anthony Zador: “Prof. Zador believes that neuromodulation is the dominant form of global signaling in the brain. However, while global signals may be very important to a model’s function, they won’t add much computational burden (the same goes for processes that proceed on longer timescales). It takes fewer bits to specify a global signal, almost by definition” (p. 4). Dr. Dario Amodei also took the slow timescales of such signals as evidence that they would not introduce substantially additional FLOP/s. See also Moravec (1988), who writes that “broadcast chemical messages are slow and contain only a relatively small amount of information. In a program their effect can probably be mimicked by a modest number of global variables that are referenced by other computations” (p. 163).

355.Araque and Navarrete (2010): “The nervous system is formed by two major cell types, neurons and glial cells. Glial cells are subdivided into different types with different functions: oligodendroglia, microglia, ependimoglia and astroglia… Glial cells, and particularly astrocytes—the most abundant glial cell type in the central nervous system—were considered to play simple supportive roles for neurons, probably because they lack long processes connecting sensory and effector organs” (p. 2375). Bullock et al. (2005): “Astrocytes are now known to communicate among themselves by means of glial transmitters and neuromodulators as well as by gap junctions (18). Moreover, astrocytes can detect neurotransmitters that are released from neuronal chemical synapses (21). These transmitters are delivered via synaptic vesicles into the synaptic cleft and diffuse to perisynaptic astrocytes. Additionally, neurotransmitters can be released outside the synapse and detected by perisynaptic glia (22, 23). In response, astrocytes can regulate communication between neurons by modifying synaptic transmission through the release of neurotransmitters and neuromodulators (18). Thus, there may be a parallel system of information processing that interacts with neuronal communication but propagates over much slower time scales through a functionally reticular network of non-neuronal cells” (p. 792). Sandberg and Bostrom (2008): “Glia cells have traditionally been regarded as merely supporting actors to the neurons, but recent results suggest that they may play a fairly active role in neural activity” (p. 36).

356.See abstract.

357.Min et al. (2012): “astrocytes can sense a wide variety of neurotransmitters and signaling molecules, and respond with increased Ca²⁺ signaling” (p. 3). More detail: “when stimulated with specific metabotropic receptor agonists, astrocytes display prominent and extremely slow (up to 10 s of seconds) whole-cell Ca²⁺ responses…. astrocytes can modulate neurons by releasing transmitters themselves. These so-called gliotransmitters are very diverse, including conventional transmitters like GABA and glutamate, as well as signaling molecules like purines, D-serine, taurine, cytokines, peptides, and metabolites like lactate (Volterra and Meldolesi (2005)). Astrocytes can release transmitters through two mechanisms. Firstly, they can release transmitter containing vesicles through SNARE mediated exocytosis. Astrocytes contain the necessary proteins for SNARE mediated exocytosis (Araque et al. (2000); Bezzi et al. (2004); Parpura and Zorec (2010); Schubert et al. (2011)), and genetic or pharmacological interference with proteins of the SNARE-complex in astrocytes inhibits numerous forms of astrocyte-neuron signaling (Pascual et al. (2005); Jourdain et al. (2007); Halassa et al. (2009); Henneberger et al. (2010); Min and Nevian (2012)). Secondly, transmitter can be released through reverse transport (Héja et al. (2009)), or through membrane channels (Kozlov et al. (2006); Lee et al. (2010))… (p. 2-3). See Porter and McCarthy (1997) for more discussion of astrocyte receptors.

358.Min et al. (2012): “When stimulated with specific metabotropic receptor agonists, astrocytes display prominent and extremely slow (up to 10 s of seconds) whole-cell Ca²⁺ responses. This is also true for in vivo experiments, where sensory stimulation reliably induces astroglial slow Ca²⁺ transients (Wang et al. (2006)) sometimes related to vascular responses (Petzold et al., 2008). The recorded Ca²⁺ signal can remain restricted to a single or few astrocytes responding to specific sensory stimuli (Wang et al. (2006); Schummers et al. (2008)). Additionally, since astrocytes form complex networks through gap-junctional coupling with neighboring astrocytes (for review see Giaume (2010); Giaume et al. (2010)) Ca²⁺ signals can spread like a wave through the astrocyte network (Nimmerjahn et al. (2009); Kuga et al. (2011)). Although the mechanisms underlying the propagation of such Ca²⁺ waves are not fully understood, transport of either IP3 or Ca²⁺ itself through gap-junctions may play an important role (Venance et al. (1997)). Furthermore, regenerative activity through astrocytic release of signaling molecules like ATP, which in turn activate Ca²⁺ signals in neighboring astrocytes, can be involved in Ca²⁺ wave propagation (Guthrie et al. (1999))” (p. 2).

359.Kirischuk et al. (2012): “In addition to generally acknowledged Ca²⁺ excitability of astroglia, recent studies have demonstrated that neuronal activity triggers transient increases in the cytosolic Na⁺ concentration ([Na⁺]i) in perisynaptic astrocytes. These [Na⁺]i transients are controlled by multiple Na⁺-permeable channels and Na⁺-dependent transporters; spatiotemporally organized [Na⁺]i dynamics in turn regulate diverse astroglial homeostatic responses such as metabolic/signaling utilization of lactate and glutamate, transmembrane transport of neurotransmitters and K⁺ buffering. In particular, near-membrane [Na⁺]i transients determine the rate and the direction of the transmembrane transport of GABA and Ca²⁺” (abstract). Bernardinell et al. (2004): “Glutamate-evoked Na⁺ increase in astrocytes has been identified as a signal coupling synaptic activity to glucose consumption. Astrocytes participate in multicellular signaling by transmitting intercellular Ca²⁺ waves. Here we show that intercellular Na⁺ waves are also evoked by activation of single cultured cortical mouse astrocytes in parallel with Ca²⁺ waves; however, there are spatial and temporal differences. Indeed, maneuvers that inhibit Ca²⁺ waves also inhibit Na⁺ waves; however, inhibition of the Na⁺/glutamate cotransporters or enzymatic degradation of extracellular glutamate selectively inhibit the Na⁺ wave. Thus, glutamate released by a Ca²⁺ wave-dependent mechanism is taken up by the Na⁺/glutamate cotransporters, resulting in a regenerative propagation of cytosolic Na⁺ increases. The Na⁺ wave gives rise to a spatially correlated increase in glucose uptake, which is prevented by glutamate transporter inhibition. Therefore, astrocytes appear to function as a network for concerted neurometabolic coupling through the generation of intercellular Na⁺ and metabolic waves” (abstract).

360.Min et al. (2012): “astrocytes can sense a wide variety of neurotransmitters and signaling molecules, and respond with increased Ca²⁺ signaling. But how do astrocytes signal back to neurons? Broadly speaking, astrocytes can do this through three separate mechanisms. Firstly, because astrocytes are crucial for ion homeostasis, they can influence neurons by dynamically altering the ionic balance. Secondly, astrocytes can alter neuronal functioning by modulating the uptake of neurotransmitter molecules from the extracellular space (Theodosis et al. (2008)). Thirdly, astrocytes can release transmitters themselves (Araque et al. (2001))” (p. 3).

361.Min et al. (2012): “Several studies have shown that astrocytes can regulate neuronal excitability. Astrocytes can achieve this through several mechanisms: by regulation of the extracellular ionic composition, by maintaining a tonic extracellular transmitter concentration, by regulation of basal synaptic transmission, and by the induction of phasic events in neighboring neurons” (p. 4). Min et al. (2012): “In addition to modulating neuronal excitability and basal synaptic transmission, astrocytes play a role in the specific strengthening or weakening of synaptic connections, either transiently (short-term plasticity), or long-lasting (long-term plasticity)” (p. 5). See p. 5-9 for more details on astrocyte involvement in short-term and long-term plasticity. Baldwin and Eroglu (2017): “astrocytes are key players in circuit formation, instructing the formation of synapses between distinct classes of neurons” (p. 1).

362.Oberheim et al. (2006): “Human protoplasmic astrocytes manifest a threefold larger diameter and have tenfold more primary processes than those of rodents” (p. 547). On these grounds, Oberheim et al. (2006) propose that the human brain’s astrocytes may play a role in explaining its unique computational power: “By integrating the activity of a larger contiguous set of synapses, the astrocytic domain might extend the processing power of human brain beyond that of other species” (p. 552).

363.Sakry et al. (2014): “Oligodendrocyte precursor cells (OPC) characteristically express the transmembrane proteoglycan nerve-glia antigen 2 (NG2) and are unique glial cells receiving synaptic input from neurons. The development of NG2+ OPC into myelinating oligodendrocytes has been well studied, yet the retention of a large population of synapse-bearing OPC in the adult brain poses the question as to additional functional roles of OPC in the neuronal network. Here we report that activity-dependent processing of NG2 by OPC-expressed secretases functionally regulates the neuronal network” (p. 1). Káradóttir et al. (2008): “We show here that there are two distinct types of morphologically identical oligodendrocyte precursor glial cells (OPCs) in situ in rat CNS white matter. One type expresses voltage-gated sodium and potassium channels, generates action potentials when depolarized and senses its environment by receiving excitatory and inhibitory synaptic input from axons” (p. 1).

364.Bullock et al. (2005): “Myelinating glia do not fire action potentials, but they can detect impulses in axons through membrane receptors that bind signaling molecules. These include ATP (16) and adenosine (17) that are released along the axon and also potassium that is released during intense neural activity” (p. 792). de Faria, Jr. et al. (2019): “Alternatively, active axons can also signal OPCs [oligodendrocyte precursor cells] via non‐synaptic vascular release of growth factors [e.g. platelet‐derived growth factor (PDGF) AA and neurotrophins] and neurotransmitters (e.g. glutamate, GABA or ATP). OPCs express not only ion channels including glutamate‐activated ion channels, the sodium and potassium channels, but also receptors of growth factors. These cellular properties make OPCs equipped to respond to neuronal activity” (p. 450).

365.Stobart et al. (2018b): “We identified calcium responses in both astrocyte processes and endfeet that rapidly followed neuronal events (∼120 ms after). These fast astrocyte responses were largely independent of IP3R2-mediated signaling and known neuromodulator activity (acetylcholine, serotonin, and norepinephrine), suggesting that they are evoked by local synaptic activity. The existence of such rapid signals implies that astrocytes are fast enough to play a role in synaptic modulation and neurovascular coupling” 726)(. Agarwal et al. (2017); Bindocci et al. (2017); Lind et al. (2018); Otsu et al. (2015); Srinivasan et al. (2015); Stobart et al. (2018a) Winship et al. (2007): “These in vivo findings suggest that astrocytes can respond to sensory activity in a selective manner and process information on a subsecond time scale, enabling them to potentially form an active partnership with neurons for rapid regulation of microvascular tone and neuron–astrocyte network properties” (p. 6268). Min et al. (2012): “Two parallel studies have indeed identified small and relatively fast Ca²⁺ signals that are restricted to the astrocyte process (Di Castro et al. (2011); Panatier et al. (2011)). Two main classes of local calcium events have been identified: focal highly confined transients (about 4μm) and more robust regional events (about 12 μm; Figure 1; Di Castro et al. (2011)). The more local events have been proposed to be generated by spontaneous single vesicle release at individual synapses whereas the expanded events seem to be generated by single action potentials activating several neighboring synapses in the astrocyte domain” (p. 2-3).

366.Panatier et al. (2011): “we show that astrocytes in the hippocampal CA1 region detect synaptic activity induced by single-synaptic stimulation… single pulse stimulation of neuronal presynaptic elements evoked local Ca²⁺ events in an astrocytic process” (p. 785, p. 787).

367.Wang et al. (2009): “Astrocytes are electrically non-excitable cells that, on a slow time scale of seconds, integrate synaptic transmission by dynamic increases in cytosolic Ca²⁺.” Panatier et al. (2011): “the detection and modulation mechanisms in astrocytes are deemed too slow to be involved in local modulation of rapid, basal synaptic transmission. Indeed, although Ca²⁺ activities have been reported in glial processes (Nett et al. (2002), Perea and Araque (2005), Santello et al. (2011), Wang et al. (2006)), Ca²⁺ signaling has been generally studied globally in the whole astrocyte, where the slow timescale of Ca²⁺ changes precludes any spatial and temporal match with fast and localized synaptic transmission. Moreover, trains of sustained stimulation of afferents were necessary to induce this type of glial Ca²⁺ activity” (p. 785).

368.Min et al. (2012): “The temporal characteristics of astrocytic Ca²⁺ transients have led to the idea that unlike neurons, astrocytes display exclusively particularly slow responses, and that their signals are not suited to be restricted to small cellular compartments, as happens for example, in dendritic spines” (p. 2).

369.von Bartheld et al. (2016): “The recently validated isotropic fractionator demonstrates a glia:neuron ratio of less than 1:1 and a total number of less than 100 billion glial cells in the human brain. A survey of original evidence shows that histological data always supported a 1:1 ratio of glia to neurons in the entire human brain, and a range of 40-130 billion glial cells. We review how the claim of one trillion glial cells originated, was perpetuated, and eventually refuted” (p. 1).

370.von Bartheld et al. (2016): “All three methods: histology, DNA extraction, and the IF method support numbers of about 10–20 billion neurons and at most a 2-fold larger number of glial cells (20–40 billion) in the human cerebral cortical grey matter, thus supporting an average GNR of approximately 1.5. Inclusion of the white matter (that underlies the grey matter of cerebral cortex) increases the GNR to about 3.0” (p. 11)

371.Verkhratsky and Butt, eds. (2013): “The authors tried to calculate the relative numbers of glial cell types, and they found that astrocytes accounted for ~20 percent, oligodendrocytes for 75 per cent and micro glia for 5 per cent of the total glial cell population. The identifying criteria, however, were rather doubtful, since no specific staining was employed… In the earlier morphological studies, based on 2d counting, the distribution of glial cell types was found to be: astrocytes 40 per cent, oligodendrocytes 50 per cent and microglia 5-10 percent (Blinkow and Glezer (1968)” (p. 95-96).

372.Verkhratsky and Butt, eds. (2013): “NG2-glia constitute 8-9 per cent of total cells in white matter and 2-3 per cent of total cells in the gray matter, with an estimated density of 10-140 mm² in the adult CNS (Nishyama et al., 2009)” (p. 326).

373.This was a point suggested by Dr. Dario Amodei. See also Open Philanthropy’s non-verbatim notes from a conversation with Prof. Konrad Kording: “Glial cells would imply a factor of two in required compute, but we are likely to be so many orders of magnitude wrong already that incorporating glia will not make the difference” (p. 3).

374.Oberheim et al. (2006): “Taking into account the increase in size of protoplasmic astrocytes that accompanies this increased synaptic density, we can estimate that each astrocyte supports and modulates the function of roughly two million synapses” (p. 549). Verkhratsky and Butt, eds. (2013): “A single protoplastmic astrocyte in rodent cortex contacts 4-8 neurones, surrounds ~300-600 neuronal dendrites and provides cover for up to 20,000-120,000 synapses residing within its domain (Bushong et al. (2002); Halassa et al. (2007b))… Human protoplasmic astrocytes are 2-3 times larger and exceedingly more complex; the processes of a single human protoplasmic astrocyte cover approximately 2 million synapses” (p. 114). Winship et al. (2007): “It is worth noting that astrocyte processes can contact up to 100,000 synapses (Bushong et al. (2002))” (p. 6271).

375.Their methodology assumes that “the same type of neuron or non-neuronal cells is assumed to approximately have a similar energy expenditure no matter where they located (in GM or WM)” (p. 14). Given roughly equal numbers of neurons and non-neuronal cells in the brain as a whole (see Azevedo et al. (2009), (p. 536), this would naively suggest that neurons account for roughly 97% of the brain’s overall energy consumption. However, I’m not sure that such a naive application of their estimate is appropriate.

376.This is a point made by AI Impacts, who also add that “although we can imagine many possible designs on which glia would perform most of the information transfer in the brain while neurons provided particular kinds of special-purpose communication at great expense, this does not seem likely given our current understanding.”

377.“FIG. 3. (A) Distribution of signaling-related ATP usage among different cellular mechanisms when the mean firing rate of neurons is 4
Hz. The percentages of the expenditure maintaining resting potentials, propagating action potentials through a neuron, and driving
presynaptic Ca2+ entry, glutamate recycling, and postsynaptic ion fluxes, are shown (100% = 3.29 × 109 ATP/neuron/s). (B) Comparison of our predicted distribution of signaling-related energy consumption with the distribution of mitochondria observed by Wong-Riley (1989). For the dendrites + soma column, Wong-Riley’s data are the percentage of mitochondria in dendrites, whereas our prediction is the percentage of energy expended on postsynaptic currents, dendritic and somatic action potentials, and the neuronal resting potential. For the axons + terminals column, Wong-Riley’s data are the percentage of mitochondria in axons and presynaptic terminals, and our prediction is for the percentage of energy expended on axonal action potentials, presynaptic Ca²⁺ entry, accumulating glutamate into vesicles, and recycling vesicles. The close spacing of terminals along axons (5 µm, implying a diffusion time of only 25 milliseconds (Braitenberg and Schüz (1998)) will make terminal and axonal mitochondria functionally indistinguishable. For the glia column, WongRiley’s data are the percentage of mitochondria in glia, whereas our prediction is for the energy expended on the glial resting potential, glutamate uptake, and its conversion to glutamine. This comparison ignores the 25% of energy expenditure not related to signaling (see Discussion), and the possibility that some processes (for example, in glia) may be driven mainly by glycolysis” (p. 1140).

378.From Open Philanthropy’s non-verbatim notes from a conversation with Prof. Anthony Zador: “Glia are very important to understanding disease, but Prof. Zador does not believe that they are important to computing in the brain” (p. 4).

379.See Siegelbaum and Koester (2013d), (p. 178)

380.See Siegelbaum and Koester (2013d), (p. 178)

381.See Siegelbaum and Koester (2013d), (p. 178)

382.Siegelbaum and Koester (2013d): “Most synapses in the brain are chemical” (p. 177). Lodish et al. (2000): “We also briefly discuss electric synapses, which are much rarer, but simpler in function, than chemical synapses.” Purves et al. (2001): “Although they are a distinct minority, electrical synapses are found in all nervous systems, including the human brain.” Wang et al. (2010) suggest probabilities of 0.5% and 1.4% of coupling between pyramidal cells in different brain regions. From Open Philanthropy’s non-verbatim notes from a conversation with Prof. Chris Eliasmith: “Adding gap junctions probably would not substantially increase the overall compute budget, because they are not very common” (p. 4). From Open Philanthropy’s non-verbatim notes from a conversation with Prof. Barak Pearlmutter: “Prof. Pearlmutter characterized the comparatively minimal number of gap junction as the “bottom line” with respect to their computational role (p. 3).

383.Siegelbaum and Koester (2013d): “Electrical synapses are employed primarily to send rapid and stereotyped depolarizing signals. In contrast, chemical synapses are capable of more variable signaling and thus can produce more complex behaviors. They can mediate either excitatory or inhibitory actions in postsynaptic cells and produce electrical changes in the postsynaptic cell that last from milliseconds to many minutes. Chemical synapses also serve to amplify neuronal signals, so even a small presynaptic nerve terminal can alter the response of large postsynaptic cells. Not surprisingly, most synapses in the brain are chemical” (p. 177). Bullock et al. (2005) also suggest that “electrical transmission through gap junctions was initially considered primitive and likely incapable of the subtleties of chemical transmission through axon-dendrite synapses” (p. 792). From Open Philanthropy’s non-verbatim notes from a conversation with Dr. Jess Riedel: “From a computational perspective, electrical synapses lack gain – the ability to amplify signals. Dr. Riedel recalls that gain is a key property of computational units like transistors” (p. 5).

384.From Open Philanthropy’s non-verbatim notes from a conversation with Dr. Adam Marblestone: “Sometimes the coupling between neurons created by gap junctions is so fast that they are treated as one neuron for modeling purposes. Gap junctions are also often thought of as supporting some kind of oscillation or globally coherent behavior that might not require a lot of computation. Whether gap junctions could create more computationally-expensive, non-linear interactions between different parts of neurons is an interesting question” (p. 6). Bennett and Zukin (2004): “Gap junctions can synchronize electrical activity and may subserve metabolic coupling and chemical communication as well. They are thought to play an important role in brain development, morphogenesis, and pattern formation (Bennett et al. (1991), Bruzzone et al. (1996), Dermietzel et al. (1989), Goodenough et al. (1996))” (p. 495).

385.From Open Philanthropy’s non-verbatim notes from a conversation with Prof. Barak Pearlmutter: “[Prof. Pearlmutter] took the fact that gap junctions are roughly linear, and that they don’t involve time delays, as evidence they would be easy to model” (p. 3). Though Bullock et al. (2005) seem to suggest some forms of complex behavior: “an electrical impulse in one cell by no means inevitably propagates to the other cells with which it shares gap junctions. In fact, a channel within a gap junction is not necessarily open, and an entire gap junction may not transmit electrical current until it is appropriately modified in response to transmission from chemical synapses of the same, ‘presynaptic’ neuron” (p. 792).

386.Trenholm et al. (2013): “We identified a network of electrically coupled motion–coding neurons in mouse retina that act collectively to register the leading edges of moving objects at a nearly constant spatial location, regardless of their velocity” (abstract).

387.From Open Philanthropy’s non-verbatim notes from a conversation with Dr. Stephen Larson: “Dr. Larson thinks that gap junctions can contribute to non-linear dynamics and near-chaotic dynamics within neural networks. As a rough rule of thumb: the more non-linear a system is, the more computationally expensive it is to simulate” (p. 3).

388.From Open Philanthropy’s non-verbatim notes from a conversation with Prof. Chris Eliasmith: “You can model a gap junction as a connection that updates every timestep, rather than every time a spike occurs” (p. 4).

389.They show that a wave of periodic neural activity can propagate across two physically separated pieces of hippocampal tissue (separation that removes the possibility of chemical or electrical synaptic communication), and that this propagation was blocked by a mechanism that cancels the relevant electrical field – results that strongly suggest ephaptic effects as a causal mechanism. Chiang et al. (2019): “To confirm the absence of any role of synaptic transmission and to eliminate other forms of communication between neurons except for ephaptic coupling, we next examined the possibility that electric fields generated by pyramidal neurons could propagate through a cut in the tissue by activating other cells across a small gap of the tissue, thereby eliminating chemical, electrical synapses (gap junctions), or axonal transmission. Fig. 4A and B shows the propagation of the slow hippocampal periodic activity before and after the cut in the tissue. To ensure that the slice was completely cut, the two pieces of tissue were separated and then rejoined while a clear gap was observed under the surgical microscope. The slow hippocampal periodic activity could indeed generate an event on the other side of a complete cut through the whole slice (Fig. 4B). However, the slow hippocampal periodic activity failed to trigger the activity across the gap when the distance of the gap increased (Fig. 4C). The expanded window in Fig. 4D shows that the waveforms of the slow hippocampal periodic activity and the delay between two signals measured in recording electrodes 1 and 2 were similar. The speed of the slow hippocampal periodic activity across the tissue was not affected by the presence of the cut in Fig. 4E (t test, n = 36 events in 3 slices). Therefore, this experiment shows that slow hippocampal periodic activity can propagate along a cut tissue by activating cells on the other side without any chemical and electrical synaptic connections at a similar speed to those observed in the intact tissue” (p. 255).

390.Anastassiou et al. (2011): “We found that extracellular fields induced ephaptically mediated changes in the somatic membrane potential that were less than 0.5 mV under subthreshold conditions. Despite their small size, these fields could strongly entrain action potentials, particularly for slow (<8 Hz) fluctuations of the extracellular field” (abstract).Chang (2019): “Ephaptic coupling has been suggested as a mechanism involved in modulating neural activity from different regions of the nervous system (Jefferys (1995); Weiss and Faber (2010); Anastassiou and Koch (2015)) especially in the vertebrate retina (Vroman et al. (2013)) and in the olfactory circuit (Su et al. (2012)). Several studies also indicate that weak electric fields can influence the neural activity at the cortical and hippocampal network level (Francis et al. (2003); Deans et al. (2007); Fröhlich and McCormick (2010)). In hippocampal slices, weak electric fields can affect the excitability of pyramidal cells and the synchronization of the hippocampal network (Francis et al. (2003); Deans et al. (2007)). In the cortex, weak electric fields have also been shown to modulate slow periodic activity in the in vitro preparation (Frohlich & McCormick, Fröhlich and McCormick (2010)). Although endogenous electric fields are thought to be too weak to excite neurons, two recent studies suggest that weak electric fields are involved in the propagation of epileptiform activity at a specific speed of 0.1 m s−1(Zhang et al. (2014); Qiu et al. (2015))” (p. 250).

391.Chiang et al. (2019): “Slow oscillations have been observed to propagate with speeds around 0.1 m s⁻¹ throughout the cerebral cortex in vivo… The mechanism most consistent with the data is ephaptic coupling whereby a group of neurons generates an electric field capable of activating the neighbouring neurons” (p. 250).

392.Anastassiou and Koch (2015): “The biggest question about ephaptic coupling to endogenous fields remains its functional role: does such nonsynaptic, electric communication contribute to neural function and computationsin the healthy brain (e.g., in the absence of the strong fields generated during epileptic seizures or other pathological brain states)? And, if yes, where, how and under which conditions? While characterizing ephaptic effects at the level of synapses, neurons and circuits in slice remains invaluable, ephaptic coupling must ultimately be studied in behaving animals. This is particularly so as such effects are likely to be small (e.g., compared to spike threshold) and spatially diffuse (in the case of LFPs), suggesting a circuit-wide feedback mechanism, that is, at the level where neural processing relevant to behavior occurs [62]” (see “Outlook”).

393.From Open Philanthropy’s non-verbatim notes from a conversation with Prof. Anthony Zador: “Prof. Zador believes that ephaptic communication is very unlikely to be important to the brain’s information-processing” (p. 4).

394.Resting membrane potential is typically around -70 mV, and the threshold for firing is around -55 mV, though these vary somewhat. Anastassiou and Koch (2015): “such effects are likely to be small (e.g., compared to spike threshold)” (see “Outlook”).

395.Anastassiou and Koch (2015): “The usefulness of such studies for understanding ephaptic coupling to endogenous fields is limited–chiefly, the cases emulated in slice oversimplify in vivo activity where neurons are continuously bombarded by hundreds of postsynaptic currents along their intricate morphology in the presence of a spatially inhomogeneous and temporally dynamic electric field (Figure 1c; compare to fields in Figure 1a,b). Such limitations are present both for fields induced across parallel plates positioned millimeters away from each other (e.g., [24, 25, 30]) as well as fields elicited via stimulation pipettes (e.g., [1, 28]). To account for the impact of endogenous fields on single neurons, both the intracellular and extracellular voltage would not only need to be monitored along a single cell but also manipulated, and all this in the behaving animal” (see “Neurons (mesoscale)”).

396.From Open Philanthropy’s non-verbatim notes from a conversation with Prof. Anthony Zador: “Prof. Zador believes that ephaptic communication is very unlikely to be important to the brain’s information-processing. Even if it was important, though, it would be a form of global signaling, and so comparatively inexpensive to model.” (p. 4). From Open Philanthropy’s non-verbatim notes from a conversation with Prof. Barak Pearlmutter: “He also suggested that ephaptic effects would be ‘in the noise’ because they are bulk effects, representation of which would involve one number that covers thousands of synapses” (p. 3).

397.Sandberg and Bostrom (2008)): “If ephaptic effects were important, the emulation would need to take the locally induced electromagnetic fields into account. This would plausibly involve dividing the extracellular space (possibly also the intracellular space) into finite elements where the field can be assumed to be constant, linear or otherwise easily approximable. The cortical extracellular length constant is on order of ≈100 μm (Gardner‐Medwin (1983)), which would necessitate on the order of 1.4∙10¹² such compartments if each compartment is 1/10 of the length constant. 37 Each compartment would need at least two vector state variables and 6 components of a conductivity tensor; assuming one byte for each, the total memory requirements would be on the order of 10 terabytes. Compared to estimates of neural simulation complexity, this is relatively manageable. The processing needed to update these compartments would be on the same order as a detailed compartment model of every neuron and glia cell” (p. 36-7).

398.Bullock et al. (2005), describing the history of early neuroscience: “physiological studies established that conduction of electrical activity along the neuronal axon involved brief, all-or-nothing, propagated changes in membrane potential called action poten- tials. It was thus often assumed that neuronal activity was correspondingly all-or- nothing and that action potentials spread over all parts of a neuron. The neuron was regarded as a single functional unit: It either was active and “firing” or was not” (p. 791).

399.Zbili and Debanne (2019): “When it invades the presynaptic terminal, the spike provokes the opening of voltage-gated calcium channels (Cav), leading to an increase of Ca²⁺concentration in the bouton and the release of neurotransmitters. Due to the power law between intra-terminal Ca²⁺ concentration and neurotransmitter release, small variations in presynaptic calcium entry, occurring through spike shape modifications, can lead to large changes in synaptic transmission (Sabatini and Regehr (1997); Bollmann et al. (2000); Bischofberger et al. (2002); Fedchyshyn and Wang (2005); Yang and Wang (2006); Bucurenciu et al. (2008); Scott et al. (2008); Neishabouri and Faisal (2014)). In fact, spike broadening during repetitive firing entails synaptic transmission facilitation in the pituitary nerve (Jackson et al. (1991)), dorsal root ganglion (Park and Dunlap (1998)) and mossy fiber bouton (Geiger and Jonas (2000)). Other studies showed that spike amplitude depression during repetitive firing provokes a decrease in synaptic transmission at hippocampal (Brody and Yue (2000); Prakriya and Mennerick (2000); He et al. (2002)) and cerebellar synapses (Kawaguchi and Sakaba (2015))” (p. 2).

400.Zbili and Debanne (2019): “the synaptic strength depends on the subthreshold membrane potential of the presynaptic cell, indicating that the presynaptic spike transmits this analog information to the postsynaptic cell. However, the direction of this modulation of synaptic transmission seems to depend on the type of synapse” (p. 5). Zbili and Debanne (2019), reviewing the literature on effects of this broad type, report increases in neurotransmitter release ranging from 10-100%, depending on the study (p. 7). Shu et al. (2006), for example, caused a 29% median enhancement to the impact of a spike through synapse in ferret pyramidal cells by changing the membrane potential in the soma in a manner that stayed below the threshold for an action potential (abstract).

401.Juusola et al. (1996): “Many neurons use graded membrane-potential changes, instead of action potentials, to transmit information. Traditional synaptic models feature discontinuous transmitter release by presynaptic action potentials, but this is not true for synapses between graded-potential neurons. In addition to graded and continuous transmitter release, they have multiple active zones, ribbon formations and L-type Ca²⁺ channels. These differences are probably linked to the high rate of vesicle fusion required for continuous transmitter release. Early stages of sensory systems provide some of the best characterized graded-potential neurons, and recent work on these systems suggests that modification of synaptic transmission by adaptation is a powerful feature of graded synapses” (abstract).

402.Graubard et al. (1980): “Graded synaptic transmission occurs between spiking neurons of the lobster stomatogastric ganglion. In addition to eliciting spike-evoked inhibitory potentials in postsynaptic cells, these neurons also release functionally significant amounts of transmitter below the threshold for action potentials. The spikeless postsynaptic potentials grade in amplitude with presynaptic voltage and can be maintained for long periods. Graded synaptic transmission can be modulated by synaptic input to the presynaptic neuron” (p. 3733).

403.Graded synaptic transmission is distinct from the spontaneous release of neurotransmitter associated with what are called “miniature postsynaptic currents.” From Faisal et al. (2008): “The classic manifestation of synaptic noise is the spontaneous miniature postsynaptic current (mPSC) that can be recorded in the absence of presynaptic input. Katz and collaborators interpreted mPSCs as being the result of spontaneously released neurotransmitter vesicles, thus establishing the quantal nature of synaptic transmission” (p. 7).

404.See Dugladze et al. (2012): “We found that during in vitro gamma oscillations, ectopic action potentials are generated at high frequency in the distal axon of pyramidal cells (PCs) but do not invade the soma. At the same time, axo-axonic cells (AACs) discharged at a high rate and tonically inhibited the axon initial segment, which can be instrumental in preventing ectopic action potential back-propagation. We found that activation of a single AAC substantially lowered soma invasion by antidromic action potential in postsynaptic PCs. In contrast, activation of soma-inhibiting basket cells had no significant impact. These results demonstrate that AACs can separate axonal from somatic activity and maintain the functional polarization of cortical PCs during network oscillations” (abstract). See also Sheffield (2011): “In a subset of rodent hippocampal and neocortical interneurons, hundreds of spikes, evoked over minutes, resulted in persistent firing that lasted for a similar duration. Although axonal action potential firing was required to trigger persistent firing, somatic depolarization was not. In paired recordings, persistent firing was not restricted to the stimulated neuron – it could also be produced in the unstimulated cell. Thus, these interneurons can slowly integrate spiking, share the output across a coupled network of axons, and respond with persistent firing even in the absence of input to the soma or dendrites” (abstract).

405.Pre-synaptic hyperpolarization (decreasing the membrane potential) can have effects within 15-50 ms. Zbili and Debanne (2019): “ADFs present various time constants which determine their potential roles in network physiology. In fact, in most of the studies, d-ADF needs 100 ms to several seconds of presynaptic depolarization to occur. On the contrary, h-ADF can be produced by fast presynaptic hyperpolarization (15–50 ms; Rama et al. (2015a)). This difference is well explained by the underlying mechanism of d-ADF and h-ADF: slow accumulation of basal Ca²⁺ (Bouhours et al. (2011); Christie et al. (2011)) or slow Kv inactivation for d-ADF (Shu et al. (2006), Shu et al. (2007); Kole et al. (2007); Bialowas et al. (2015)), fast recovery from inactivation of Nav for h-ADF (Rama et al. (2015a); Zbili et al. (2016)). Therefore, d-ADF and h-ADF should have different consequences on information transfer in neuronal networks” (p. 8).

406.Sheffield (2011): “In a subset of rodent hippocampal and neocortical interneurons, hundreds of spikes, evoked over minutes, resulted in persistent firing that lasted for a similar duration” (abstract).

407.Zbili and Debanne (2019) report that in most studies, it takes “100 ms to several seconds of presynaptic depolarization” (p. 8).

408.My understanding is that the applicability of this consideration depends on the “length” or “space” constant associated with different axons in the brain, where the relevant issue is that the influence of pre-synaptic membrane potential changes along the axon decays exponentially in absence of active participation from ion channels. Here’s Backyard Brains on the length/space constant: “let’s talk about the length constant (this is sometimes also called the “space constant”). The length constant (λ, or lambda) is a measure of how far the voltage travels down the axon before it decays to zero. If you have a length constant of 1 mm, that means at 1 mm away from the cell body in an axon, 37% of the voltage magnitude remains. At 2 mm away from the cell body in an axon, 14% of the magnitude remains, and at 3 mm away, 5% remains. This is representative of an ‘exponential decay’ function.” Here’s Zbili and Debanne (2019) on how this applies to analog-digital signaling along the axon: “One of the main issues concerning Analog-Digital Facilitations is the spatial extent of these phenomena along the axon. In fact, ADFs are produced by subthreshold modifications of the somatic potential that spreads to the presynaptic terminal and modifies presynaptic spike shape or basal Ca²⁺ (Debanne et al. (2013); Rama et al. (2015b)). Therefore, the axonal space constant is a major determinant of the spatial extent of ADF. The axonal space constant varies among neuronal types, depending on the axonal diameter, the density of axonal branching and the axonal membrane resistance (Sasaki et al. (2012)). In CA3 hippocampal neurons, the axonal space constant has been evaluated around 200–500 μm (Sasaki et al. (2012); Bialowas et al. (2015); Rama et al. (2015a)). In L5 pyramidal neurons, the value estimated ranges between 500 μm (Shu et al. (2006); Kole et al. (2007)) and 1,000 μm (Christie and Jahr (2009)). In CA1 pyramidal neurons, the axonal space constant was found to be around 700 μm (Kim (2014)). Therefore, ADFs seem to be restricted to local brain circuits. For example, d-ADF has been found between CA3 neurons but not at the synapses between CA3 and CA1 neurons (Sasaki et al. (2012)). However, several lines of evidence suggest that ADFs could also occur between more distant neurons…” (p. 160).

409.Moore and Cao (2008): “The standard modern view of blood flow is that it serves a physiological function unrelated to information processing, such as bringing oxygen to active neurons, eliminating “waste” generated by neural activity, or regulating temperature” (p. 2035).

410.See Moore and Cao (2008), (p. 2037-2040).

411.Moore and Cao (2008): “the somatosensory neocortex, blood flow increases measured using laser Doppler have been observed <200 ms after the onset of sensory-evoked neural responses (Matsuura et al. (1999); Norup Nielsen and Lauritzen (2001)). Similarly, optical imaging techniques that integrate over local volumes at somewhat slower temporal resolution typically record a significant increase in flow within ≤500 ms of sensory stimulus presentation (Dunn et al. (2005); Malonek et al. (1997); Martin et al. (2006)). The subsequent duration of these increases is often viewed as “poorly correlated” with neural activity, because functional hyperemia can sustain for seconds after the onset and offset of a stimulus. As discussed in a later section, this sustained temporal pattern may not be a mismatch between activity and flow, but rather may be consistent with the information processing role of blood flow” (p. 2037).

412.From Open Philanthropy’s non-verbatim notes from a conversation with Dr. Stephen Larson: “It’s generally thought that blood flow is more of an epiphenomenon/a sign that other forms of information processing are occurring (akin to the heat generated by a CPU), than a mechanism of information-processing in itself” (p. 4).

413.The exact number, along with the definition of a column, appears to be the subject of some debate (see Rakic (2008) for complaints). Krueger (2008): “In humans, each column contains 1000 to 10,000 cells.”

414.Moore and Cao (2008): “In the somatosensory and visual neocortex, a general consensus exists that the pattern of increased blood flow is similar to that of subthreshold neural activity, with a peak in signal that is localized to a cortical column (400 m) and an extent spanning several columns (Dunn et al. (2005); Hess et al. (2000); Lauritzen (2001); Sheth et al. (2004); Vanzetta et al. (2004); Yang et al. (1998)) … In other brain areas, evidence for more precise delivery has also been observed, because flow can be localized to a single glomerulus in the olfactory bulb during stimulus presentation (i.e., 100 m) (Chaigneau et al. (2003); Yang et al. (1998))” (p. 2037).

415.Other possibilities include the perineuronal net (see Tsien (2013) for discussion), and classical dynamics in microtubules (see Cantero et al. (2018)). I leave out the other two mechanisms partly because of time constraints, and partly because my impression is that they do not feature very prominently in the discourse on this topic.

416.Though see the non-verbatim notes from Open Philanthropy’s non-verbatim notes from a conversation with Prof. Anthony Zador: “Prof. Zador is skeptical that there are major unknown unknowns in the parts list in the brain, given how much effort has gone into studying nervous systems. Biology is complicated, and there is still more to understand, but Prof. Zador does not think that what we are missing is a breakthrough in biology. Rather, what’s missing is an understanding of the brain’s organizing principles” (p. 4).

417.A number of experts we engaged with indicated that many computational neuroscientists would not emphasize other mechanisms very much (though their comments in this respect are not publicly documented); and the experts I interviewed didn’t tend to emphasize such mechanisms either.

418.Technically, this would be ~3e13-3e17 FLOP/s, if we were really adding up synaptic transmission, firing decisions, and learning. But these ranges are sufficiently made-up and arbitrary that this sort of calculation seems to me misleadingly precise.

419.That is, I did not do fully independent analyses of each of these areas and then combine them (this is why the ranges are so similar). Rather, I started with a baseline, default model of 1 FLOP per spike through synapse, and then noted that budgeting 10-100x of cushion on top of that would cover various salient complexities and expert estimates across various of these categories.

420.Funabiki et al. (2011): “In owls, NL neurons change their firing rates with changes in ITD of <10 μs (Carr and Konishi (1990); Peña et al. (1996)), far below the spike duration of the neurons (e.g., ∼1 ms). The data used for modeling these coincidence detection processes have so far come from in vitro studies in the chick’s NL (Reyes et al. (1996); Funabiki et al. (1998); Kuba et al. (2005), (2006); Slee et al. (2010)), extracellular studies of the barn owl’s NL neurons (Carr and Konishi (1990); Peña et al. (1996); Fischer et al. (2008)), and the owl’s behavioral performance (Knudsen et al. (1979)). Specialized cellular mechanisms, including extraordinary fast glutamate receptors (Reyes et al. (1996); Trussell (1999); Kuba et al. (2005)), low threshold-activated potassium conductance (KLVA) (Reyes et al. (1996)), and remote spike initiation (Carr and Boudreau (1993b); Kuba et al. (2006); Ashida et al. (2007)), have been discussed as important elements of this extraordinary precise coincidence detection” (p. 15245).

421.From Open Philanthropy’s non-verbatim notes from a conversation with Prof. Eric Jonas: “Active dendritic computation could conceivably imply something like 1-5 orders of magnitude more compute than a simple linear summation model of a neuron. And if dendritic morphology is evolving over time, you also need to be thinking about the space of all possible dendrites that could have formed, in addition to the current dendritic tree” (p. 3). He also added, though, “it’s reasonable to think that at the end of the day, simplified dendritic models are available. For example, Prof. Jonas has heard arguments suggesting that post-synapse, there is very little plasticity in dendrites, and that dendritic computation mostly involves applying random features to inputs” (p. 3).

422.See e.g. Bhalla (2014).

423.Kaplanis et al. (2018): “we show that by equipping tabular and deep reinforcement learning agents with a synaptic model that incorporates this biological complexity (Benna and Fusi (2016)), catastrophic forgetting can be mitigated at multiple timescales. In particular, we find that as well as enabling continual learning across sequential training of two simple tasks, it can also be used to overcome within-task forgetting by reducing the need for an experience replay database” (p. 1). Zenke et al. (2017): “In this study, we introduce intelligent synapses that bring some of this biological complexity into artificial neural networks. Each synapse accumulates task relevant information over time, and exploits this information to rapidly store new memories without forgetting old ones. We evaluate our approach on continual learning of classification tasks, and show that it dramatically reduces forgetting while maintaining computational efficiency” (abstract).

424.From Open Philanthropy’s non-verbatim notes from a conversation with Prof. Eve Marder: “In reality, the nervous system has an incredible ability to move seamlessly between timescales ranging from milliseconds to years, and the relevant processes interact. That is, short time-scale processes influence long time-scale processes, and vice versa. And unlike digital computers, the brain integrates over very long timescales at very fast speeds easily and seamlessly” (p. 2).

425.See von Bartheld et al. (2016): “The recently validated isotropic fractionator demonstrates a glia:neuron ratio of less than 1:1… We review how the claim of one trillion glial cells originated, was perpetuated, and eventually refuted.” (p. 1)).

426.From Open Philanthropy’s non-verbatim notes from a conversation with Prof. Erik De Schutter: “The brain was not engineered. Rather, it evolved, and evolution works by adding complexity, rather than by simplification… Indeed, in general, many scientists who approach the brain from an engineering perspective end up on the wrong footing. Engineering is an appropriate paradigm for building AI systems, but if you want to understand the brain, you need to embrace the fact that it works because it is so complicated. Otherwise, it will be impossible to understand the system” (p. 4).

427.See e.g. Kempes et al. (2017): “Here we show that the computational efficiency of translation, defined as free energy expended per amino acid operation, outperforms the best supercomputers by several orders of magnitude, and is only about an order of magnitude worse than the Landauer bound” (p. 1). Rahul Sarpeshkar, in a 2018 TED talk, suggests that cells are the most energy efficient computers that we know, and that they are already computing at an efficiency near the fundamental laws of physics (3:30-4:04). See also Laughlin et al. (1998): “Freed from heavy mechanical work, ion channels change conformation in roughly 100 μs32. In principle, therefore, a single protein molecule, switching at the rate of an ion channel with the stoi- chiometry of kinesin, could code at least 10³ bit per second at a cost of 1 ATP per bit” (p. 39). See Sarpeshkar (2013) for more on computation in cells, and Sarpeshkar (2010) for more on the energy-efficiency of biological systems more generally: “A single cell in the body performs ~10 million energy-consuming biochemical operations per second on its noisy molecular inputs with ~1 pW of average power. Every cell implements a ~30,000 node gene-protein molecular interaction network within its confines. All the ~100 trillion cells of the human body consume ~80 W of power at rest. The average energy for an elementary energy-consuming operation in a cell is about 20kT, where kT is a unit of thermal energy. In deep submicron processes today, switching energies are nearly 10⁴ – 10⁵kT for just an elementary 0->1 digital switching operation. Even at 10 nm, the likely end of business-as-usual transistor scaling in the future, it is unlikely that we will be able to match such energy efficiency. Unlike traditional digital computation, biological computation is tolerant to error in elementary devices and signals. Nature illustrates that it is significantly more energy efficient to compute with error-prone devices and signals and then correct for these errors through feedback-and-learning architectures than to make every device and every signal in a system robust, as in traditional digital paradigms thus far” (p. 18-19). Bennett (1989) also suggests that “a few thermodynamically efficient data processing systems do exist, notably genetic enzymes such as RNA polymerase, which, under appropriate reactant concentrations, can transcribe information from DNA to RNA at a thermodynamic cost considerably less than kT per step” (p. 766).

428.See e.g. from Open Philanthropy’s non-verbatim notes from a conversation with Prof. Eric Jonas: “Various discoveries in biology have altered Prof. Jonas’s sense of the complexity of what biological systems can be doing. Examples in this respect include non-coding RNA, the complexity present in the three-dimensional structure of the cell, histone regulatory frameworks, and complex binding events involving different chaperone proteins. The class of computation that Prof. Jonas can imagine a single cell doing now seems multiple orders of magnitude more complex than it did 20 years ago” (p. 4).

429.Sarpeshkar (1998): “Items 1 through 3 show that analog computation can be far more efficient than digital computation because of analog computation’s repertoire of rich primitives. For example, addition of two parallel 8-bit numbers takes one wire in analog circuits (using Kirchoff’s current law), whereas it takes about 240 transistors in static CMOS digital circuits. The latter number is for a cascade of 8 full adders. Similarly an 8-bit multiplication of two currents in analog computation takes 4 to 8 transistors, whereas a parallel 8-bit multiply in digital computation takes approximately 3000 transistors. Although other digital implementations could make the comparisons seem less stark, the point here is simply that exploiting physics to do computation can be powerful” (p. 1605). See also Daniel et al. (2013): “Because analog computation exploits powerful biochemical mathematical basis functions that are naturally present over the entire continuous range of input operation, they are an advantageous alternative to digital logic when resources of device count, space, time or energy are constrained” (p. 619).

430.See e.g. Open Philanthropy’s non-verbatim notes from a conversation with Prof. Eve Marder: “Unlike digital computers, the brain integrates over very long timescales at very fast speeds easily and seamlessly” (p. 3).

431.From Open Philanthropy’s non-verbatim notes from a conversation with Prof. Rosa Cao: “Digital computers achieve speed and reliability by ignoring many dimensions of what is happening in the system. In such a context, you only care about whether the voltage in the transistors is above or below a certain threshold, and designers try hard to shield this variable from disruptive physical fluctuations. The brain is built on fairly different principles. Its functional processes are not shielded from the dynamics of the brain’s biochemistry. Rather, the brain exploits this biochemistry to perform efficient computation. This makes the brain difficult to simulate. In nature, biochemical processes like protein-protein interactions just happen, so they are “free” for the brain to run. Simulating them, however, can be quite computationally expensive” (p. 1-2).

432.From Open Philanthropy’s non-verbatim notes from a conversation with Dr. Adam Marblestone: “Neuroscience is extremely limited by available tools. For example, we have the concept of a post-synaptic potential because we can patch-clamp the post-synaptic neuron and see a change in voltage. When we become able to see every individual dendritic spine, we might see that each has a different response; or when we become able to see molecules, we might see faster state transitions, more interesting spatial organization, or more complicated logic at the synapses. We don’t really know, because we haven’t been able to measure. It’s also possible that some theories in neuroscience emerge and persist primarily because (a) they are the type of simple ideas that humans are able to come up with, and (b) these theories explain some amount of data (though it’s unclear how much). It’s hard to formulate complicated ideas about how the brain works that can then be made testable. “ (p. 9). From Open Philanthropy’s non-verbatim notes from a conversation with Prof. Erik De Schutter: “with improvements in imaging and cell biology techniques, we discover all sorts of new complexities that we didn’t know were there” (p. 1).

433.Thanks to Luke Muehlhauser for suggesting this possibility.

434.From Open Philanthropy’s non-verbatim notes from a conversation with Prof. Eric Jonas: “There is a history of over-optimism about scientific progress in neuroscience and related fields. Prof. Jonas grew up in an era of hype about progress in science (e.g., “all of biology will yield its secrets in the next 20 years”), and has watched the envisioned future fail to arrive. Indeed, many problems have been multiple orders of magnitude more complicated than expected, to such a degree that some people are now arguing that science is slowing down, and must rely increasingly on breadth-first search through possible research paths. In biology, for example, there was a lot of faith that the human genome project would lead to more completeness and understanding than it did” (p. 4-5). From Open Philanthropy’s non-verbatim notes from a conversation with Prof. Rosa Cao: “E. Coli, a comparatively simple, one-celled organism, exhibits fairly sophisticated behavior on the basis of carefully-tuned biochemical chains (for example, various rhythms at different timescales that allow the cell to survive in a range of environments). We have not yet been successfully able to capture this behavior in a computational model, despite throwing a lot of effort and computational power at the project. Indeed, there was a lot of excitement about projects like this a few decades ago, but it seems to Prof. Cao that this energy has since died down, partly due to greater appreciation of their difficulty. Similarly, efforts to build an artificial cell have proven very difficult. At some level, cells are simple, and we basically know what the components are. However, all of the biochemical processes are poised in a delicate balance with each other – a balance that represents a vanishingly smaller percentage of all possible arrangements, and which is correspondingly difficult to replicate. Efforts to create functional brain simulations might run into similar problems. For example, it may be that the brain’s function depends on a particular type of relationship to the environment, which allows it to adjust and fine-tune its internal features in the right way” (p. 2).

435.From Open Philanthropy’s non-verbatim notes from a conversation with Prof. Eric Jonas: “many in the neuroscience community feel that some neuroscientists made overly aggressive claims in the past about what amount of progress in neuroscience to expect (for example, from simulating networks of neurons at a particular level of resolution)” (p. 5).

436.From Open Philanthropy’s non-verbatim notes from a conversation with Prof. Eric Jonas: “[Prof. Jonas] also has a long-term prior that researchers are too quick to believe that the brain is doing whatever is currently popular in machine learning, and he doesn’t think we’ve found the right paradigm yet” (p. 3). From Open Philanthropy’s non-verbatim notes from a conversation with Prof. Erik De Schutter: “He is also wary of the history of comparing the brain to the latest engineering technology (e.g., a steam engine, a classical computer, now maybe a quantum computer)” (p. 4).

437.Two experts thought this unlikely. From Open Philanthropy’s non-verbatim notes from a conversation with Dr. Adam Marblestone: “Dr. Marblestone thinks that the probability that the field of neuroscience rests on some very fundamental paradigm mistake is very low. We’re missing a unified explanation of behavior and intelligence, but the basic picture of neurons as modular elements with some sort of transfer function and some sort of (possibly complicated) learning rule, without some extreme amount of internal computation taking place inside the cell, seems fairly solid to Dr. Marblestone” (p. 7).

438.Thanks to Dr. Dario Amodei and Dr. Owain Evans for suggesting that I consider correlations between different routes to higher numbers.

439.From Open Philanthropy’s non-verbatim notes from a conversation with Prof. Markus Meister: “Synapses are noisy, and silicon isn’t; and the brain uses huge numbers of neurons to represent the same variable, probably because a single neuron can’t do it robustly. Prof. Meister expects that human-level AI systems will use methods more naturally suited to silicon devices. This would suggest compute estimates lower than what scaling up from the retina would suggest” (p. 4). See Miller (2018): “The key variables of a firing-rate model are the firing rates, which correspond to the average number of spikes per unit time of a subset of similarly responsive cells. This is in contrast to spiking models in which the key variables are the membrane potentials of individual cells” (p. 211). Eliasmith (2013): “Consequently, we can think of the 2D state space as a standard Cartesian space, where two values (x and y co-ordinates) uniquely specify a single object as compactly as possible. In contrast, the 100D vector specifies the same underlying 2D object, but it takes many more resources (i.e., values) to do so. If there was no uncertainty in any of these 100 values, then this would simply be a waste of resources. However, in the much more realistic situation where there is uncertainty (resulting from noise of receptors, noise in the channels sending the signals, etc.), this redundancy can make specifying an underlying point much more reliable. And, interestingly, it can make the system much more flexible in how well it represents different parts of that space. For example, we could use 10 of those neurons to represent the first dimension, or we could use 50 neurons to do so. The second option would give a much more accurate representation of that dimension than the first. Being able to redistribute these resources to respond to task demands is one of the foundations of learning (see Section 6.4)” (p. 75). From Open Philanthropy’s non-verbatim notes from a conversation with Dr. Adam Marblestone: “One way you might need less than 1 FLOP per spike through synapse is if you don’t need to model all of the neurons in the brain. For example, it might be that all of the neurons and synapses in the brain are there in order to make the brain more likely to converge on a solution while learning, but that once learning has taken place, the brain implements a function that can be adequately approximated using much less compute. A large amount of neuroscience treats populations of neurons as redundant representations of high-level variables relevant to information-processing” (p. 7).

440.From the author summary: “A network in the brain consists of thousands of neurons. A priori, we expect that the network will have as many degrees of freedom as its number of neurons. Surprisingly, experimental evidence suggests that local brain activity is confined to a subspace spanned by ~10 variables” (p. 1). See also Gallego et al. (2017): “Here we argue that the underlying network connectivity constrains these possible patterns of population activity (Okun et al. (2015), Sadtler et al. (2014), Tsodyks et al. (1999)) and that the possible patterns are confined to a low-dimensional manifold (Stopfer et al. (2003), Yu et al. (2009)) spanned by a few independent patterns that we call ‘neural modes.’ These neural modes capture a significant fraction of population covariance. It is the activation of these neural modes, rather than the activity of single neurons, that provides the basic building blocks of neural dynamics and function (Luczak et al. (2015), Sadtler et al. (2014), Shenoy et al. (2013))” (p. 2).

441.My thanks to the expert who suggested I consider this.

442.Faisal et al. (2008): “Averaging is used in many neural systems in which information is encoded as patterns of activity across a population of neurons that all subserve a similar function (for example, see REFS 142,143): these are termed neural population codes. A distributed representation of information of this type is more robust to the effects of noise. Many sensory systems form a spatially-ordered population — that is, a map — in which neighbouring neurons encode stimuli that share closely related features. Such spatially ordered populations support two basic goals of neural computation: first, a transformation between different maps (such as the direction of sounds into neck rotation) and, second, the combination of information from multiple sources (such as visual- and auditory-cue combination)144. The information capacity of a population of neurons is greatest when the noise sources across the population are not correlated. Noise correlations, which are often observed in populations of higher-order neurons, limit information capacity and have led to the development of population-coding strategies that account for the effects of correlations” (p. 10).

443.See p. 10 here.

444.From here: “Michael Steil and some collaborators had ported the code to C and were able to run at about 1kHz… This was only a thousand times slower than the original, running on a computer that was perhaps two million times faster.” Other emulations may be more efficient.

445.Dr. Dario Amodei suggests considering whether we can leave out the cerebellum for certain types of tasks.

446.From the National Organization for Rare Disorders: “Additional reports have noted individuals with cerebellar agenesis whose mental capacities were unaffected and who did not exhibit any symptoms of cerebellar agenesis (asymptomatic cases). However, other researchers have disputed these claims, stating that in virtually all of cases of cerebellar agenesis there have been observable symptoms including profound abnormalities in motor skills…. Intelligence may be unaffected. However, some affected individuals may display mild to moderate cognitive impairment. Some individuals with cerebellar agenesis have exhibited intellectual disability, but normal or near-normal motor skills. In addition to affecting motor skills, damage to the cerebellum has also been associated with abnormalities of non-motor functions. Cerebellar dysfunction may also be associated with abnormalities of visuospatial abilities, expressive language, working memory and affective behavior.” Cases of cerebellar agenesis are described in a popular article by Hamilton (2015) and in Gelal et al. (2016). The case described in Hamilton (2015) seems to involve at least mild cognitive impairment: the subject described has trouble coordinating different sources of information, and he “needed to be taught a lot of things that people with a cerebellum learn automatically, Sarah [his sister] says: how to speak clearly, how to behave in social situations and how to show emotion.” The cases in Gelal et al. (2016) also appear to involve substantive cognitive impairment: “The 61-year-old man had ataxia, dysarthria, abnormalities in cerebellar tests, severe cognitive impairment, and moderate mental retardation. The 26-year-old woman had dysmetria, dysdiadochokinesia, and dysarthria as well as mild cognitive impairment and mild mental retardation” (abstract)).

447.Swanson (1995) (p. 473).

448.Azevedo et al. (2009) (p. 536), suggests that the cerebellum weights ~154.02 g (10.3% of the brain’s mass), whereas the cerebral cortex weighs 1232.93 g (81.8% of the brain’s mass).

449.I’m basing this on the fact that the cerebellum is ~10% of the brain’s weight, relative to ~80% for the cortex, and Howarth et al’s (2012) suggestion that energy consumption per gram is higher in the cerebral cortex than in the cerebellar cortex: “Including this range of values would result in a range of estimates for total energy use for the cerebral cortex of 27.2 to 40.7 μmol ATP/g/min, compared with the measured total energy use of 33 to 50 μmol ATP/g/min in different cortical regions (Sokoloff et al. (1977)), and for the cerebellar cortex of 17.1 to 25.6 μmol ATP/g/min, compared with the measured value of 20.5 μmol ATP/g/min (Sokoloff et al. (1977)). Further work is needed to accurately define these parameters” (p. 1232). Sarpeshkar (1997): “Most of the power in the brain is consumed in the cortex” (p. 204). Thanks to Carl Shulman for suggesting that I consider cerebellar energy consumption, and for pointing me to references.

450.Most of the neurons in the cerebellum (specifically, about 50 billion, at least according to Llinás et al. (2004) (p. 277)) are cerebellar granule cells, which appear to have a comparatively small number of synapses each: “[Granule] cells are the most numerous in the CNS; there are about 5 × 10¹⁰ cerebellar granule cells in the human brain. Each cell has four or five short dendrites (each less than 30 μm long) that end in an expansion called a dendritic claw (Fig. 7.4C)” (Llinás et al. (2004) (p. 277). Wikipedia cites Llinás et al. (2004)) as grounds for attributing 80-100 synaptic connections to granule cells, but I haven’t been able to find the relevant number. The cerebellum also contains Purkinje cells (up to 1.5e7, according to Llinás et al. (2004), p. 276), which can have over 100,000 synapses each, though I’m not sure about the average number (see Napper and Harvey (1988): “We conclude that there are some 175,000 parallel fiber synapses on an individual Purkinje cell dendritic tree in the cerebellar cortex of the rat” (abstract), though this is an old estimate). I have not attempted to estimate the synapses in the cerebellum in particular, and I am not sure the extent to which synapse counts for granule cells and Purkinje cells overlap (a possibility that could lead to double counting). Energy use in the cerebellum appears to be dominated by granule cells: “This work predicts that the principal neurons in the cerebellum, the Purkinje cells, use only a small fraction of the energy consumed by the cerebellar cortex, while the granule cells dominate the signaling energy use” (Howarth et al. (2012), p. 1230-1231). Many estimates for total synapses in the brain focus on the cerebral cortex, and in particular the neocortex (see citations in section Section 2.1.1.1), and AI Impacts reports the impression, which I share, that neocortical synapses are often treated as representing the bulk of the synapses in the brain. Indeed, Kandel et al. (2013) suggests that “10¹⁴ to 10¹⁵ synaptic connections are formed in the brain” (p. 175) – a number comparable to the neocortical estimates from Tang et al. (2001) (“The average total number of synapses in the neocortex of five young male brains was 164 × 10¹² (CV = 0.17)” (p. 258)) and Pakkenberg et al. (2003) (“The total number of synapses in the human neocortex is approximately 0.15 × 10¹⁵ (0.15 quadrillion)” (p. 95)).

451.For example, Pulsifer et al. (2004) report that in a study of 71 patients who underwent hemispherectomy for severe and intractable seizures, “Cognitive measures typically changed little between surgery and follow-up, with IQ change <15 points for 34 of 53 patients” (abstract) (though absolute levels of cognitive ability may still have been low), and Pavone et al. (2013) suggest that “The results obtained from the literature show that relative preservation of cognitive performance suggests that a single cerebral cortical hemisphere connected to an apparently intact brainstem is sufficient for the development of higher cognitive function” (p. 2). See also this article in the New Scientist, which reports that “a teenager who was born without the entire left hemisphere of her brain has above-average reading skills – despite missing the part of the brain that is typically specialised for language…The 18-year-old also has an average-to-high IQ and plans to go to university.”

452.Glancing at one study, asymptomatic Alzehimer’s disease does not appear to be associated with neuron loss. See Andrade-Moraes et al. (2013): “We found a great reduction of neuronal numbers in the hippocampus and cerebral cortex of demented patients with Alzheimer’s disease, but not in asymptomatic subjects with Alzheimer’s disease” (abstract).

453.Dr. Dario Amodei suggested considering these constraints. See also the citations throughout the rest of the section.

454.Sandberg (2016): “Biology has many advantages in robustness and versatility, not to mention energy efficiency. Nevertheless, it is also fundamentally limited by what can be built out of cells with a particular kind of metabolism, the fact that organisms need to build themselves from the inside, and the need of solving problems that exist in a particular biospheric environment” (p. 7).

455.See Moravec (1988): “There is insufficient information in the 10¹⁰ bits of the human genome to custom-wire many of the 10¹⁴ synapses in the brain” (p. 166). See also Zador (2019): “ The human genome has about 3 × 10⁹ nucleotides, so it can encode no more than about 1 GB of information—an hour or so of streaming video32. But the human brain has about 10¹¹ neurons, and more than 10³ synapses per neuron. Since specifying a connection target requires about log₂10¹¹ = 37 bits/synapse, it would take about 3.7 × 10¹⁵ bits to specify all 10¹⁴ connections. (This may represent an underestimate because it considers only the presence or absence of a connection; a few extra bits/synapse would be required to specify graded synaptic strengths. But because of synaptic noise and for other reasons, synaptic strength may not be specified very precisely. So, in large and sparsely connected brains, most of the information is probably needed to specify the locations [of] the nonzero elements of the connection matrix rather than their precise value.). Thus, even if every nucleotide of the human genome were devoted to efficiently specifying brain connections, the information capacity would still be at least six orders of magnitude too small” (p. 5).

456.Moravec (1988): “The slow switching speed and limited signaling accuracy of neurons rules out certain solutions for neural circuitry that are easy for computers” (p. 165). Dmitri Strukov’s comments here: “we should also keep in mind that over millions of years the evolution of biological brains has been constrained to biomaterials optimized for specific tasks, while we have a much wider range of material choices now in the context of neuromorphic engineering. Therefore, there could exist profound differences in designing rules. For example, the brains have to rely on poor conductors offered by biomaterials, which have presumably affected the principles of brain structure and operation in some ways that are not necessarily to be applicable to neuromorphic computing based on high conducting materials.”

457.Moravec (1988): “The neuron’s basic information-passing mechanism – the release of chemicals that affect the outer membranes of other cells – seems to be a very primitive one that can be observed in even the simplest free-swimming bacteria. Animals seem to be stuck with this arrangement because of limitations in their design process. Darwinian evolution is a relentless optimizer of a given design, nuding the parameters this way and that, adding a step here, removing one there, in a plodding, tinkering, way. It’s not much of a redesigner, however. Fundamental changes at the foundation of its creations are out of reach, because too many things would have to change correctly all at once” (p. 168).

458.Here, the distinction between “finding ways to do it the way the brain does it, but with a high-level of simplification/increased efficiency” and “doing it some other way entirely” is blurry. I have the former vaguely in mind, but see the appendix for more detailed discussion. See also Sandberg (2016) for more discussion of possible constraints: “While we have reason to admire brains, they are also unable to perform certain very useful computations. In artificial neural networks we often employ non-local matrix operations like inversion to calculate optimal weights (Toutounian and Ataei (2009)): these computations are not possible to perform locally in a distributed manner. Gradient descent algorithms such as backpropagation are unrealistic in a biological sense, but clearly very successful in deep learning. There is no shortage of papers describing various clever approximations that would allow a more biologically realistic system to perform similar operations — in fact, the brains may well be doing it — but artificial systems can perform them directly, and by using low-level hardware intended for it, very efficiently” (p. 7).

459.From Open Philanthropy’s non-verbatim notes from a conversation with Prof. Markus Meister: “The computations performed in the retina are fairly well-understood. There is more to learn, of course, but the core framework is in place. We have a standard model of the retina that can account for a lot of retinal processing, as well as predict new observations… The retina is probably the best understood part of the brain” (p. 1-2).

460.See Yue et al. (2016) for a review of progress in retinal implant development as of 2016. From the Stanford Artificial Retina Project: “The current state of the art of retinal prostheses can be summed up as such: no blind patient today would trade their cane or guide dog for a retinal implant.” From Open Philanthropy’s non-verbatim notes from a conversation with Prof. Markus Meister: “Despite 30 years of effort, attempts to create functional artificial retinas have met with very little success. Recent performance tests show that people implanted with the devices are functionally blind – e.g., they cannot read, and they cannot distinguish between letters unless the letters occupy the entire visual field” (p. 3). Nirenberg and Pandarinath (2012) say: “Current devices still provide only very limited vision. For example, they allow patients to see spots of light and high-contrast edges, which provide some ability for navigation and gross feature detection, but they are far from providing patients with normal representations of faces, landscapes, etc. (4–6). [With respect to navigation, the devices enable the detection of light sources, such as doorways and lamps, and, with respect to feature detection, they allow discrimination of objects or letters if they span ∼7° of visual angle (5); this corresponds to about 20/1,400 vision; for comparison, 20/200 is the acuity-based legal definition of blindness in the United States (7)]” (p. 15012), though their paper aims to improve the situation.

461.From Open Philanthropy’s non-verbatim notes from a conversation with Prof. Markus Meister: “However, this lack of success is not about computation. People in the field generally agree that if you could make the right kind of one-to-one connection to the optic nerve fibers, you could compute spike trains that would allow the brain to see. The obstacle is actually making the interface between an electrical device and the retina. Electrodes on top of the retina stimulate many nerve fibers at once; you don’t know ahead of time which fiber you’ll be stimulating or what type of retinal ganglion cell you’re connected to, and you can’t get data into the eye at the right rate” (p. 3).

462.See Moravec (1988), Chapter 2 (p. 51-74). See also Moravec (1988) and Moravec (2008). Merkle (1989) uses a broadly similar methodology.

463.See Moravec (1988) (p. 57-60). For discussion of what a center-surround and a motion-detection operation in the retina consists in, see Meister et al. (2013): “A typical ganglion cell is sensitive to light in a compact region of the retina near the cell body, called the cell’s receptive field. Within that area one can often distinguish a center region and surround region in which light produces opposite responses. An ON cell, for example, fires faster when a bright spot shines on the receptive field’s center but decreases its firing when the spot shines on the surround. If light covers both the center and the surround, the response is much weaker than for center-only illumination. A bright spot on the center combined with a dark annulus on the surround elicits very strong firing. For an OFF cell these relationships are reversed; the cell is strongly excited by a dark spot in a bright annulus (Figure 26-10). The output produced by a population of retinal ganglion cells thus enhances regions of spatial contrast in the input, such as an edge between two different areas of different intensity, and gives less emphasis to regions of homogeneous illumination” (p. 587). See Meister et al. (2013) (p. 588-589), and this graphic, for visual depictions of center-surround type responses. With respect to retinal representation of moving objects, Meister et al. (2013) write: “When an effective light stimulus appears, a ganglion cell’s firing typically increases sharply from the resting level to a peak and then relaxes to an intermediate rate. When the stimulus turns off, the firing rate drops sharply then gradually recovers to the resting level… a moving object elicits strong firing in the ganglion cell population near the edges of the object’s image because these are the only regions of spatial contrast and the only regions where the light intensity changes over time” (p. 587, see p. 588-589 for more on motion-detection).

464.See Moravec (1988) (p. 58-59). That said, he also acknowledges that “though separate frames cannot be distinguished faster than 10 per second, if the light flickers at the frame rate, the flicker itself is detectable until it reaches a frequency of about 50 flashes per second” (p. 59).

465.See Gollisch and Meister (2010): “When the image of an object moves on the retina, it creates a wave of neural activity among the ganglion cells. One should expect that this wave lags behind the object image because of the delay in phototransduction. Instead, experiments show that the activity in the ganglion cell layer moves at the true location of the object or even along its leading edge (Berry et al. (1999)). Effectively, the retinal network computes the anticipated object location and thereby cancels the phototransduction delay” (p. 7-8).

466.See Gollisch and Meister (2010): “A somewhat different form of anticipation can be observed when the visual system is exposed to a periodic stimulus, such as a regular series of flashes. The activated visual neurons typically become entrained into a periodic response. If the stimulus sequence is interrupted, for example by omitting just one of the flashes, some neurons generate a pulse of activity at the time corresponding to the missing stimulus (Bullock et al. (1990); Bullock et al. (1994)). This phenomenon, termed the “omitted stimulus response”, is quite widespread, and has been noted in the brains of many species, including humans (McAnany and Alexander (2009)). Qualitatively it suggests the build-up of an anticipation for the next stimulus, and the large response reflects surprise at the missing element in the sequence” (p. 7-8).

467.Gollisch and Meister (2010): “Because the ambient light level varies over ~9 orders of magnitude in the course of a day, while spiking neurons have a dynamic range of only ~2 log units, the early visual system must adjust its sensitivity to the prevailing intensities. This adaptation to light level is accomplished by the retina, beginning already in the photoreceptors, and the process is complete before spiking neurons get involved. Over a wide range of intensities, the sensitivity of the retina declines inversely with the average light level. As a result, the ganglion cell signals are more or less independent of the illuminating intensity, but encode the reflectances of objects within the scene, which are the ethologically important variables. The perceptual effects of light adaptation and its basis in the circuitry and cellular mechanisms of the retina have been studied extensively and covered in several excellent reviews (Shapley and Enroth-Cugell (1984); Hood (1998); Fain et al. (2001); Rieke and Rudd (2009))” (p. 11).

468.Gollisch and Meister (2010): “During a saccade, the image sweeps across the retina violently for tens of milliseconds, precluding any useful visual processing. In humans, visual perception is largely suppressed during this period (Volkmann (1986); Burr et al. (1994); Castet and Masson (2000)). The circuits of the retina are at least partly responsible for this suppression: Many types of retinal ganglion cell are strongly inhibited during sweeps of the visual image (Roska and Werblin (2003)). This effect is mediated by spiking, inhibitory amacrine cells, which are themselves excited by the global motion signal. Conceivably, the underlying circuitry resembles the one identified for OMS ganglion cells (Figure 2C). In fact, the OMS cells may be distinct simply by an enhanced sensitivity to the global inhibition, so they are suppressed even by the much smaller eye movements during a fixation” (p. 9).

469.Gollisch and Meister (2010): “The anatomical diversity suggests that there is much function left to be discovered and that we probably still have a good distance to go before understanding all the computations performed by the retina” (p. 14).

470.From Open Philanthropy’s non-verbatim notes from a conversation with Prof. Markus Meister: “It has taken more effort to simulate retinal responses to natural scenes than to artificial stimuli used in labs (e.g. spots, flashes, moving bars)” (p. 1). Heitman et al. (2016): “This paper tests how accurately one pseudo-linear model, the generalized linear model (GLM), explains the responses of primate RGCs to naturalistic visual stimuli … The GLM accurately reproduced RGC responses to white noise stimuli, as observed previously, but did not generalize to predict RGC responses to naturalistic stimuli. It also failed to capture RGC responses when fitted and tested with naturalistic stimuli alone. Fitted scalar nonlinearities before and after the linear filtering stage were insufficient to correct the failures. These findings suggest that retinal signaling under natural conditions cannot be captured by models that begin with linear filtering, and emphasize the importance of additional spatial nonlinearities, gain control, and/or peripheral effects in the first stage of visual processing” (p. 1).

471.See Figure 1C in Maheswaranathan et al. (2019), and Batty et al. (2017): “RNNs of varying architectures consistently outperformed LNs and GLMs in predicting neural spiking responses to a novel natural scene movie for both OFF and ON parasol retinal ganglion cells in both experiments (Figure 2)” (p. 6).

472.From Open Philanthropy’s non-verbatim notes from a conversation with Prof. E.J. Chichilnisky: “It’s hard to know when to stop fine-tuning the details of your model. A given model may be inaccurate to some extent, but we don’t know whether a given inaccuracy matters, or whether a human wouldn’t be able to tell the difference (though focusing on creating usable retinal prostheses can help with this)” (p. 3).

473.From Open Philanthropy’s non-verbatim notes from a conversation with Dr. Stephen Baccus: “The visual system works under a wide range of conditions – for example, varying light levels and varying contrast levels. Experiments focused on a set of natural scenes only cover some subset of these conditions. For example, Prof. Baccus’s lab has not really tested dim light, or rapid transitions between bright and dim light” (p. 2). From Open Philanthropy’s non-verbatim notes from a conversation with Prof. E.J. Chichilnisky: “One of the biggest challenges is the world of possible stimuli. It would take lifetimes to present all possible stimuli, so we don’t know if we’re missing something. Prof. Chichilnsky’s lab has the biggest trove of data in the world from retinal ganglion cells. They’ve recorded from something like 500,000 retinal ganglion cells (roughly half the retina), and they have about 50 billion spikes. But even this may not be enough data” (p. 3).

474.From Open Philanthropy’s non-verbatim notes from a conversation with Prof. Markus Meister: “The biochemistry involved in retinal light adaptation is well-understood, and it can be captured using a simplified computational model. Specifically, you can write down a three-variable dynamical model that gets it about 80% correct. The compute required to run a functional model of the retina would probably be dominated by the feedforward processing in the circuit, rather than by capturing adaptation” (p. 2).

475.From Open Philanthropy’s non-verbatim notes from a conversation with Dr. Stephen Baccus: “These models focus on replicating the response of an individual retinal ganglion cell to a stimulus. However, it may also be necessary to replicate correlations between the responses of different cells in the retina, as these may carry important information. Some people think that replicating the firing patterns of individual cells is enough, but most people think that correlations are important. Prof. Baccus’s lab has not yet assessed their model’s accuracy with respect to these between-cell correlations, though it is on their agenda” (p. 2).

476.From Open Philanthropy’s non-verbatim notes from a conversation with Prof. E.J. Chichilnisky: “There is variability in retinal function both across species and between individuals of the same species. Mouse retinas are very different from human retinas (a difference that is often ignored), and there is variability amongst monkey retinas as well” (p. 3).

477.For example, there are about 20 different types of retinal ganglion cells in humans (see Open Philanthropy’s non-verbatim notes from a conversation with Prof. E.J. Chichilnisky (p. 3)), which could vary in complexity. However, Prof. Stephen Baccus seemed to think that the data gathered for Maheswaranathan et al. (2019) captures this complication. From Open Philanthropy’s non-verbatim notes from a conversation with Dr. Stephen Baccus: “There is no special selection involved in choosing which cells to test, and Prof. Baccus would expect similar success with arbitrary sets of retinal ganglion cells, though one cannot account for every cell under every condition without testing it” (p. 1). Another possibility is that these CNNs/RNNs might be vulnerable to adversarial examples, in a manner analogous to the vulnerabilities exhibited by image recognition systems (see discussion in Section 3.2). And the results were obtained using isolated retinas (I believe this means that the animal’s eyes were removed from the body), which could introduce differences as well.

478.From Open Philanthropy’s non-verbatim notes from a conversation with Dr. Stephen Baccus: “Prof. Baccus and his colleagues have calculated that their CNN requires ~20 billion floating point operations to predict the output of one ganglion cell over one second (these numbers treat multiply and addition as separate operations – if we instead counted multiply-add operations (MACCs), the numbers would drop by a factor of roughly 2). The input size is 50 × 50 (pixels) × 40 time points (10 ms bins). Layer 1 has 8 channels and 36 × 36 units with 15 × 15 filters each. Layer 2 has 8 channels and 26 × 26 units with 11 × 11 filters each. Layer 3 (to the ganglion cell) is a dense layer with a 8 × 26 × 26 filter from layer 2. This leads to the following calculation for one ganglion cell:

Layer 1: (40 × 15 × 15 × 2 + 1 (for the ReLU)) × 36 × 36 units × 8 channels = 1.87e8

Layer 2: (8 × 11 × 11 × 2 + 1) × 26 × 26 units × 8 channels = 1.05e7

Layer 3: 8 × 26 × 26 × 2 = 10,816.

Total: 1.97e8 FLOP per 10 ms bin. Multiplied by 100, this equals 1.97e10 FLOP/s” (p. 6).

479.From Open Philanthropy’s non-verbatim notes from a conversation with Dr. Stephen Baccus: “Simulating more ganglion cells simultaneously only alters the last layer of the network, and so results in only a relatively small increase in computation. A typical experiment involves around 5-15 cells, but Prof. Baccus can easily imagine scaling up to 676 cells (26 × 26 — the size of the last layer), or to 2500 (50×50 — the size of the input). 676 cells would require 20.4 billion FLOPs per second. 2500 would require 22.4 billion.” (p. 6). 22.4 billion/2500 is ~9e6, which I’ve rounded to 1e7.

480.My estimate is as follows. 1st layer: (31 × 31 (image patch) + 50 (inputs from previous time-step)) × 50 = 48,050 MACCs. Second layer: (50 feedforward inputs from layer 1 + 50 inputs from previous time-step) × 50 = 5,000 MACCs. Total MACCs per timestep: ~ 53,000. Multiplied by two for FLOPs vs. MACCs (see “It’s dot products all the way down” here) = 106,000 FLOPs per time-step. Timesteps per second: 1200 (0.83 ms time bins). Total FLOPs per cell per second: ~1.2e8 FLOP/s. I have discussed this estimate with two people with ML expertise, but it has not been confirmed by any of the paper’s authors.

481.Sarpeshkar (2010) estimates at least 1e10 FLOP/s for the retina, based on budgeting at least one floating-point multiplication operation per synapse, and a 12 Hz rate of computation (p. 749). However, he doesn’t (at least in that paragraph) say much to justify this assumption; and estimates that assume 1 FLOP per event at synapses have been covered, to some extent, under the mechanistic method section already. So I’ll focus elsewhere. For what it’s worth, though, Sarpeshkar (2010) estimate would imply at least ~1e13-1e16 FLOP/s for the brain as a whole, using the scaling factors discussed below.

482.From Open Philanthropy’s non-verbatim notes from a conversation with Dr. Stephen Baccus: “The largest amount of computation takes place in the first layer of the network. If the input size was larger, these numbers would scale up” (p. 6).

483.Moravec (2008) reports that the brain is about 75,000 times heavier than the retina, which he cites as weighing 0.02 g (though Sarpeshkar (2010) estimates 0.4 g, substantially more). Moravec rounds this factor to 100,000, which in combination with his 1e9 calculations per second estimate for replicating the retina, yields a whole brain estimate of 1e14 calculations per second (this would be ~4e12 if we used Sarpeshkar’s weight estimate). See Moravec (2008), “Nervous Tissue and Computation.” Azevedo et al. (2009) (p. 536), report that the whole brain is ~1508.91 g, which is in line with what Moravec’s estimate implies (1500 g). However, Sarpeshkar (2010) (p. 748), estimates retinal weight at 0.4 g, which would result in a weight-based scale-up of 3750 – considerably less than Moravec’s rounded 100,000.

484.Moravec (1988): “The 1,500 cubic centimeter human brain is about 100,000 times as large as the retina” (p. 2). Sarpeshkar (2010) (p. 748), reports that the area of the human retina is 2500 mm², and the average thickness is 160 µm, for a total of 400 mm³ (0.4 cm³). The brain appears to be around 1400 cm³, which suggests a scale-up, on Sarpeshkar’s numbers, of ~3500.

485.The retina has about 1e8 signaling cells if you include all the photoreceptors (though Stephen Baccus indicated that for bright light, it might make more sense to focus on the roughly 5e6 cones), and tens of millions of other non-photoreceptor neurons. These numbers are roughly a factor of 1000 and 10,000 less, respectively, than the brain’s neuron count (1e11). From Open Philanthropy’s non-verbatim notes from a conversation with Dr. Stephen Baccus: “We can think of the retina as receiving a 100 megapixel input and outputting a 1 megapixel output (though in bright light, it’s more like 5 million inputs, because there are 5 million cones and 95 million rods). And there are something like 10 million other cells in the retina” (p. 3).

486.Sarpeshkar (2010) (p. 698), lists ~1 billion synapses in the retina, though I’m not sure where he got this number. I am assuming the synapse estimates of 1e14-1e15, discussed in Section 2.1.1.1.

487.See Sarpeshkar (2010): “The weight of the human retina is 2500 mm² (area) × 160 mm (avg. thickness) × 1000 kg/m³ (density in SI units) = 0.4 grams. Thus, the power consumption of human rods in the dark may be estimated to be 0.2 grams × 13 µmol ATP/g/min × 20 kT/ATP = 2.1mW. If we assume that outer retina power consumption is dominated by the rods, and that the inner and outer retina consume at the same rate in humans, then the total power consumption of the retina in the dark may be estimated to be 2.1 mW × 2 = 4.2 mW. We list the average of (2.6 + 4.2)/2 = 3.4 mW as our estimate for the total power consumption of the retina in Table 23.2. We thank Simon Laughlin for his generous assistance in helping us estimate the number of synapses in the retina and the power consumption of the eye” (p. 748). Following Sarpeshkar, I am here using Aiello’s (1997) estimate of 14.6 W for the brain as a whole.

488.Moravec (1988): “The retina’s evolutionarily pressed neurons are smaller and more tightly packed than average” (p. 59). See also Moravec’s (3/18/98) replies to Anders Sandberg’s comment in the Journal of Evolution and Technology: “Evolution can just as easily choose two small neurons as one twice as large. The cost in metabolism and materials is the same. So I would expect brain structures to maximize for effective computation per volume, not per neuron. After all, one neuron with ten thousand synapses might be the computational match of 50 neurons with 50 synapses each.”

489.See his reply to Moravec here: “volume cannot be compared due to the differences in tissue structure and constraints.”

490.See his reply to Moravec here. Though his high-end estimate of whole brain neuron count (1e12) is, I think, too large.

491.From Open Philanthropy’s non-verbatim notes from a conversation with Prof. E.J. Chichilnisky: “The brain is probably a lot more plastic than the retina, though this is likely a quantitative rather than a qualitative difference” (p. 4).

492.See Anders Sandberg’s 1998 comments on Moravec: “The retina is a highly optimized and fairly stereotypal neural structure, this can introduce a significant bias.”

493.For example, it needs to be packed into the eye, and to be transparent enough for light signals to pass through layers of cells to reach the photoreceptors. Anders Sandberg, in his 1998 comments on Moravec, also suggests that it needs to be two dimensional, which might preclude more interesting and complex computational possibilities implicated by 3D structures. I have not investigated this.

494.From Open Philanthropy’s non-verbatim notes from a conversation with Dr. Stephen Baccus: “There is higher connectivity in the cortex than in the retina… Recurrence might be the trickiest difference. The retina can be largely approximated as a feedforward structure (there is some feedback, but a feedforward model does pretty well), but in the cortex there is a lot of feedback between different brain regions. This might introduce oscillations and feedback signals that make precise details about spike timings (e.g., at a 1 ms level of precision) more important, and therefore make firing rate models, which blur over 10 ms, inadequate” (p. 5).

495.From Open Philanthropy’s non-verbatim notes from a conversation with Prof. E.J. Chichilnisky: “We are much further along in mapping all of the cell types in the retina than we are in the brain as a whole. Differences between cell types matter a lot in the retina. We don’t know how much these differences matter in the rest of the brain. Some people think that they don’t matter very much, but Prof. Chichilnisky disagrees, and certainly the field has been moving in the direction of emphasizing the cell type differences in the brain. However, there’s no reason to think that some neuron types in the brain/retina will be radically simple and some will be radically complicated. There will be some variations, but perhaps not a big gulf” (p. 4).

496.The retina engages in certain forms of dendritic computation (see e.g. Taylor et al. (2000) and Hanson et al. (2019)), but various dendritic computation results focus on cortical pyramidal cells, and in particular on the apical dendrite of such cells (see London and Häusser (2005) for examples). Glia, electrical synapses, and neuropeptide signaling are all present in the retina; I’m less sure about ephaptic effects (to the extent that they’re present/task-relevant anywhere).

497.See his reply to Anders Sandberg here. Drexler (2019) assumes something similar: “In the brain, however, typical INA [immediate neural activity] per unit volume is presumably less than that of activated retina” (p. 188).

498.From Open Philanthropy’s non-verbatim notes from a conversation with Prof. Markus Meister (p. 4):

There is nothing particularly simplistic about the retina, relative to other neural circuits. It probably has a hundred different cell types, it probably uses almost every neurotransmitter we know of, and it has very intricate microcircuitry. Prof. Meister would be sympathetic to scaling up from the retina as a way of putting an upper limit on the difficulty of simulating the brain as a whole. Prof. Meister has not actually done this back-of-the-envelope calculation, but budgeting based on the rate at which action potentials arrive at synapses, multiplied by the number of synapses, seems like roughly the right approach.

Though see later in that section for some small increases (2×) for dendritic computation. From Open Philanthropy’s non-verbatim notes from a conversation with Prof. E.J. Chichilnisky (p. 4):

The level of modeling detail necessary in the retina provides a good test of the level of modeling detail necessary in the brain as a whole. However, the data on the retina aren’t in, and they won’t be in for a while.

From Open Philanthropy’s non-verbatim notes from a conversation with Dr. Stephen Baccus (p. 5):

Prof. Baccus thinks the answer is ‘maybe’ to the question of whether the compute necessary to model neurons in the retina will be similar to the compute necessary to model neurons in the cortex. You might expect a volume by volume comparison to work as a method of scaling up from the retina to the cortex.

499.See Open Philanthropy’s non-verbatim notes from a conversation with Prof. Barak Pearlmutter: “Prof. Hans Moravec attempted to derive estimates of the computational capacity of the brain from examination of the retina. Prof. Pearlmutter thought that Moravec’s estimates for the computational costs of robotic vision were likely accurate, given Moravec’s expertise in vision” (p. 3).

500.See here: “Let’s say the input shape for a convolutional layer is 224×224×3, a typical size for an image classifier.” Other input sizes listed here.

501.This section is inspired by some arguments suggested by Dr. Dario Amodei, to the effect that ML vision models might be put into productive comparison with parts of the visual cortex (and in particular, conservatively, V1). See also Drexler (2019), who inspired some of Dr. Amodei’s analysis.

502.Some datasets have larger numbers of categories. For example, the full ImageNet dataset has 21k classes, and JFT-300M has 18,291 classes. However, many results focus on the benchmark set by the ILSVRC competition, which uses 1000 classes. I’ll focus there as well.

503.When asked to provide five labels for a given image, at least one human has managed to include the true label 94.9% of the time, Russakovsky et al. (2014): “Annotator A1 evaluated a total of 1500 test set images. The GoogLeNet classification error on this sample was estimated to be 6.8% (recall that the error on full test set of 100,000 images is 6.7%, as shown in Table 7). The human error was estimated to be 5.1%.” You can try out the task for yourself here. Karpathy (2014b), who appears to have served as Annotator A1 for Russakovsky et al. (2014), writes in a blog post: “There have now been several reported results that surpass my 5.1% error on ImageNet. I’m astonished to see such rapid progress. At the same time, I think we should keep in mind the following: Human accuracy is not a point. It lives on a tradeoff curve. We trade off human effort and expertise with the error rate: I am one point on that curve with 5.1%. My labmates with almost no training and less patience are another point, with even up to 15% error. And based on some calculations that consider my exact error types and hypothesizing which ones may be easier to fix than others, it’s not unreasonable to suggest that an ensemble of very dedicated expert human labelers might push this down to 3%, with about 2% being an optimistic error rate lower bound.” DNNs are worse on top 1 labeling, but my understanding is that this is partly because images contain multiple possible labels (see Kostyaev (2016)).

504.See Brownlee (2019b) for a breakdown of different types of object-recognition tasks, and here for example models. Hossain et al. (2018) review different image captioning models.

505.Cadena et al. (2019): “Despite great efforts over several decades, our best models of primary visual cortex (V1) still predict spiking activity quite poorly when probed with natural stimuli, highlighting our limited understanding of the nonlinear computations in V1” (abstract). See also Zhang et al. (2019): “While CNN models, especially those goal-driven ones pre-trained on computer vision tasks, performed very well in our study and some other studies (Cadena et al. (2017)) for V1 neuron modeling, we should point out that even the best-performing CNN in our study only explained about 50% of the explainable variance in our neural data, consistent with Cadena et al. (2017). The failure of CNN models for explaining the other half of the variance in V1 data can be due to a number of reasons. First, V1 neurons are subject to network interaction and their neural responses are known to be mediated by strong long-range contextual modulation. Second, it is possible that there are some basic structural components missing in the current deep CNN methodology for fully capturing V1 neural code” (p. 51-52 in the published version).

506.See Zhang et al. (2019)Kiregeskorte (2015), Yamins and DiCarlo (2016) and Lindsay (2020) for reviews.

507.Cadena et al. (2019): “We both trained CNNs directly to fit the data, and used CNNs trained to solve a high-level task (object categorization). With these approaches, we are able to outperform previous models and improve the state of the art in predicting the responses of early visual neurons to natural images” (see “Author summary”) … “We compared the models for a number of cells selected randomly (Fig 8A). There was a diversity of cells, both in terms of how much variance could be explained in principle (dark gray bars) and how well the individual models performed (colored bars). Overall, the deep learning models consistently outperformed the two simpler models of V1. This trend was consistent across the entire dataset (Fig 8B and 8D). The LNP model achieved 16.3% FEV [Fraction of explainable variance explained], the GFB model 45.6% FEV. The performance of the CNN trained directly on the data was comparable to that of the VGG-based model (Fig 8C and 8D); they predicted 49.8% and 51.6% FEV, respectively, on average” (p. 11). See also Zhang et al. (2019) for comparable results, and Klindt et al. (2017) and Antolík et al. (2016) for earlier results. Kindel et al. (2019) report that “ we trained deep convolutional neural networks to predict the firing rates of V1 neurons in response to natural image stimuli, and we find that the predicted firing rates are highly correlated (CC norm = 0.556 ± 0.01) with the neurons’ actual firing rates over a population of 355 neurons. This performance value is quoted for all neurons, with no selection filter. Performance is better for more active neurons: When evaluated only on neurons with mean firing rates above 5 Hz, our predictors achieve correlations of CCnorm = 0.69 ± 0.01 with the neurons’ true firing rates” (see abstract). I’m not sure how this fits with the characterization of the state of the art in Cadena et al. (2019).

508.Yamins et al. (2014): “We found that the top layer of the high-performing HMO model achieves high predictivity for individual IT neural sites, predicting 48.5±1.3% of the explainable IT neuronal variance (Fig. 3 B and C). This represents a nearly 100% improvement over the best comparison models and is comparable to the prediction accuracy of state-of-the-art models of lower-level ventral areas such as V1 on complex stimuli (10). In comparison, although the HMAX model was better at predicting IT responses than baseline V1 or SIFT, it was not significantly different from the V2-like model” …. Schrimpf et al. (2018): “The models from this early work outlined above outperformed all other neuroscience models at the time and yielded reasonable scores on predicting response patterns from both single unit activity and fMRI.” And Yamins and DiCarlo (2016): “It turned out that the top hidden layers of these models were the first quantitatively accurate image-computable model of spiking responses in IT cortex, the highest-level area in the ventral hierarchy (Fig. 2b,c). Similar models have also been shown to predict population aggregate responses in functional MRI data from human IT (Fig. 2d)” (p. 359). Yamins and DiCarlo (2016) also note that “These results are not trivially explained merely by any signal reflecting object category identity being able to predict IT responses. In fact, at the single neuron level, IT neural responses are largely not categorical, and ideal-observer models with perfect access to category and iden- tity information are far less accurate IT models than goal-driven HCNNs (Fig. 2a,c). Being a true image-computable neural network model appears critical for obtaining high levels of neural predictivity. In other words: combining two general biological constraints—the behavioral constraint of the object recognition task and the architec- tural constraint imposed by the HCNN model class—leads to greatly improved models of multiple layers of the visual sensory cascade” (p. 359). Schrimpf et al. (2018): “Current models still fall short of reaching benchmark ceilings: The best ANN model V4 predictivity score is 0.663, which is below the internal consistency ceiling of these V4 data (0.892). The best ANN model IT predictivity score is 0.604, which is below the internal consistency ceiling of these IT data (0.817). And the best ANN model behavioral predictivity score is 0.378, which is below the internal consistency ceiling of these behavioral data (0.497)” (p. 7). That said, I am not sure exactly what the relevant benchmark is in the context of this paper. See here for ongoing evaluation of the “brain-score” of different models – evaluation which incorporates the degree to which they predict neuron responses in IT.

509.Yamins et al. (2014): “We found that the HMO model’s penultimate layer is highly predictive of V4 neural responses (51.7±2.3% explained V4 variance), providing a significantly better match to V4 than either the model’s top or bottom layers. These results are strong evidence for the hypothesis that V4 corresponds to an intermediate layer in a hierarchical model whose top layer is an effective model of IT” (p. 8623). See also Bashivan et al. (2019): “We found that the neural predictor models correctly predicted 89% of the explainable (i.e., image-driven) variance in the V4 neural responses” (p. 1).

510.Khaligh-Razavi and Kiregeskorte (2014): “The models include well-known neuroscientific object-recognition models (e.g. HMAX, VisNet) along with several models from computer vision (e.g. SIFT, GIST, self-similarity features, and a deep convolutional neural network). We compared the representational dissimilarity matrices (RDMs) of the model representations with the RDMs obtained from human IT (measured with fMRI) and monkey IT (measured with cell recording) for the same set of stimuli (not used in training the models). Better performing models were more similar to IT in that they showed greater clustering of representational patterns by category. In addition, better performing models also more strongly resembled IT in terms of their within-category representational dissimilarities” (abstract). Yamins and DiCarlo (2016): “… Similar models have also been shown to predict population aggregate responses in functional MRI data from human IT (Fig. 2d)” (p. 359). See also Storrs et al. (2020).

511.See Yamins and DiCarlo (2016): “HCNN models that are better optimized to solve object categorization produce hidden layer representations that are better able to predict IT neural response variance” (Figure 2a, p. 360); and Schrimpf et al. (2018): “Extending prior work, we found that gains in ANN ImageNet performance led to gains on Brain-Score. However, correlation weakened at ≥ 70% top-1 ImageNet performance, suggesting that additional guidance from neuroscience is needed to make further advances in capturing brain mechanisms” (p. 1). See also http://www.brain-score.org/ for more data.

512.Yamins et al. (2014): “For example, neurons in the lowest area, V1, are well described by Gabor-like edge detectors that extract rough object outlines.” Olah et al. (2020b): “Gabor filters are a simple edge detector, highly sensitive to the alignment of the edge. They’re almost universally found in the fist [sic] layer of vision models.” They report that 44% of the units in the first conv layer of InceptionV1 are gabor filters, and that 14% of the units in conv2d1 are “complex gabor filters, which are “like Gabor Filters, but fairly invariant to the exact position, formed by adding together multiple Gabor detectors in the same orientation but different phases. We call these ‘Complex’ after complex cells in neuroscience” (see section “conv2d1”).

513.From Open Philanthropy’s non-verbatim notes from a conversation with Prof. Konrad Kording: “There is a traditional view in systems neuroscience that each brain area does something pre-assigned and simple. E.g., V1 detects edges, V4 pulls out colors and curvature, etc. But this view is dying at the moment” (p. 3). See also Roe et al. (2020): “One advanced shape property represented in V4 is curvature. Curvature, which can be considered an integration of oriented line segments, is a prominent feature of object boundaries. V4 cells (receptive fields typically 2–10 deg in size) can be strongly selective for curvature of contours (Pasupathy and Connor (1999), 2001) as well as curved (i.e., non-Cartesian) gratings (Gallant et al. (1993), 1996).” (abstract); and Walsh (1999) for more on color in the visual cortex

514.See Olah et al. (2020a): “Curve detecting neurons can be found in every non-trivial vision model we’ve carefully examined” (see Example 1: Curve Detectors). See also the corners in conv2d2 described in Olah et al. (2020b), and the color detectors described in conv2d0-2.

515.Bashivan et al. (2019): “Using an ANN-driven image synthesis method, we found that luminous power patterns (i.e., images) can be applied to primate retinae to predictably push the spiking activity of targeted V4 neural sites beyond naturally occurring levels. This method, although not yet perfect, achieves unprecedented independent control of the activity state of entire populations of V4 neural sites, even those with overlapping receptive fields. These results show how the knowledge embedded in today’s ANN models might be used to noninvasively set desired internal brain states at neuron-level resolution, and suggest that more accurate ANN models would produce even more accurate control” (p. 1).

516.Yamins and DiCarlo (2016): “within the class of HCNNs [e.g., Hierarchical Convolutional Neural Networks], there appear to be comparatively few qualitatively distinct, efficiently learnable solutions to high-variation object categorization tasks, and perhaps the brain is forced over evolutionary and developmental timescales to pick such a solution” (p. 356).

517.From Open Philanthropy’s non-verbatim notes from a conversation with Prof. Konrad Kording: “It’s true that simple models of V1 can describe 30 percent of the variance in V1’s activity. But you can describe half of the variance in the activity of your transistors just by realizing that your computer is turned off at night” (p. 3).

518.See Funke et al. (2020) for some discussion.

519.From Open Philanthropy’s non-verbatim notes from a conversation with Prof. Konrad Kording: “There is a traditional view in systems neuroscience that each brain area does something pre-assigned and simple. E.g., V1 detects edges, V4 pulls out colors and curvature, etc. But this view is dying at the moment. It was always suspicious on theoretical grounds. The fact that you know so much, about so many types of things, is in conflict with the view that each specific brain area is simple, as this view does not explain where all of the information available to you comes from. But it’s also empirically wrong. If you look at the literature, when you take a type of signal that matters to animals and looks for it in the brain, you find it everywhere. For example, you can find movement signals and expectations in the primary visual cortex, and rewards explain more of the variance in the primary motor cortex (the “movement area”) than movement. Basically, it’s all a complete mess. … Of course, there’s some specialization. Sound explains more of the variance in auditory cortex than in visual cortex. But the specialization isn’t simple. It’s just easier to publish papers saying e.g. ‘X is the brain area for romantic love,’ than e.g. ‘here are another ten variables X region is tuned to.’” (p. 3). From Open Philanthropy’s non-verbatim notes from a conversation with Prof. Markus Meister: “There is a long history, in neuroscience, of attempting to assign understandable computational roles to little chunks of brain matter (e.g., “the anterior cingulate cortex is for X”). Prof. Meister believes that this program is not going to be very successful, because these regions are massively interconnected, and we now know that if you inject signals into one part of the brain, you find them in many other parts of the brain” (p. 3).

520.Stringer et al. (2018) showed mice pictures from Imagenet (“stimuli”) while the mice also engaged in spontaneous motor behavior (“behavior”): “Stimuli and behavior were represented together in V1 as a mixed representation: there were not separate sets of neurons encoding stimuli and behavioral variables, but each neuron multiplexed a unique combination of sensory and behavioral information” (p. 11).

521.Saleem et al. (2017): “To establish the nature of these signals we recorded in primary visual cortex (V1) and in the CA1 region of the hippocampus while mice traversed a corridor in virtual reality. The corridor contained identical visual landmarks in two positions, so that a purely visual neuron would respond similarly in those positions. Most V1 neurons, however, responded solely or more strongly to the landmarks in one position…. The presence of such navigational signals as early as in a primary sensory area suggests that these signals permeate sensory processing in the cortex” (p. 1).

522.See Cadena et al. (2019), “Dataset and inclusion criteria”: “We recorded a total of 307 neurons in 23 recording sessions…We discarded neurons with a ratio of explainable-to-total variance (see Eq 3) smaller than 0.15, yielding 166 isolated neurons (monkey A: 51, monkey B: 115) recorded in 17 sessions with an average explainable variance of 0.285.”

523.Chong et al. (2016): “Using fMRI and encoding methods, we found that the ‘intermediate’ orientation of an apparently rotating grating, never presented in the retinal input but interpolated during AM [apparent motion], is reconstructed in population-level, feature-selective tuning responses in the region of early visual cortex (V1) that corresponds to the retinotopic location of the AM path” (p. 1453). See also Open Philanthropy’s non-verbatim notes from a conversation with Prof. Won Mok Shim: “There is a traditional view of V1, on which it is the front end of a hierarchical information-processing pipeline, and is responsible for processing simple, low-level features of bottom-up visual input from the retina/LGN. However, many feedback processes and connections have been discovered in V1 over the last decade, and most vision scientists would agree that V1’s information-processing cannot be entirely explained using bottom-up inputs….The anatomy of the visual system also suggests an important role for feedback. For example, there are more feedback connections from V1 to the LGN, than there are feedforward connections from the LGN to V1. V1 receives a large number of connections from other brain areas, like V2, and there are also many lateral connections between cells within V1. The direction of these connections can be identified using neuroanatomical trace studies, mostly from monkeys or cats… On an alternative to the traditional view, V1 is receiving top-down, high-level predictions, which it then compares with the bottom-up input. The difference between the two is an error signal, which is then conveyed from the low-level areas to the high-level areas. The origins of this idea are in computational theory (predictive coding). There is some empirical support as well, but the evidence is not completely clear.” (p. 1-2).

524.See e.g. Schecter et al. (2017), Cooke and Bear (2014), and Cooke et al. (2015).

525.For example, in addition to detecting features of a visual stimulus like the orientation of lines and the spatial frequency of different patterns (features at least somewhat akin to the features detected by the early layers of a ImageNet model), neurons in V1 can also detect the direction that a stimulus is moving, as well as other features of how a stimulus changes over time (see Carandini (2012): “Cells in area V1 are commonly selective for direction of stimulus motion” and “The slant of receptive fields in space-time confers V1 neurons with some selectivity for stimulus speed, but this selectivity depends on the spatial pattern of a stimulus (Movshon et al. (1978a)). Rather than speed, V1 neurons are typically thought to be selective for temporal frequency, which is the inverse of the period between temporal oscillations between dark and light” (in the “Stimulus selectivity” section)). Indeed, visual processing requires a changing stimulus (see Gilbert (2013): “Visual perception requires eye movement. Visual cortex neurons do not respond to an image that is stabilized on the retina because they require moving or flashing stimuli to be activated: they fire in response to transient stimulation” (p. 606)). The images processed by e.g. a ResNet-101, by contrast, are static (though there are computer-vision systems that operate in dynamic environments as well). V1 is also involved in integrating the different visual inputs from different eyes (see Carandini (2012): “The signals from corresponding regions in the two eyes are kept separate in the LGN, and are combined in V1” (in the “Stimulus selectivity” section)), whereas a ResNet receives only one image.

526.From Open Philanthropy’s non-verbatim notes from a conversation with Dr. Adam Marblestone: “Dr. Marblestone does not think it obvious that the visual cortex should be thought of as doing something like object-detection. It could be, for example, making a more complicated transition model based on all of its multi-modal inputs, predicting future inputs and rewards, or doing some kind of iterative inference procedure. We just don’t know quite how high-dimensional or complicated the task the visual system performs is. So any compute estimates based on comparisons between the visual system and current deep neural networks are highly uncertain” (p. 8).

527.From Open Philanthropy’s non-verbatim notes from a conversation with Dr. Kate Storrs: “Returning the name of the main object in an image is a tiny portion of what the visual system can do. Core vision involves understanding the visual world as a navigable 3D space of objects, equipped with orientations, materials, depth, properties, and behavioral affordances. Dr. Storrs would guess that object-recognition only occurs on top of that kind of description of the world. Models analogous to the visual system would need to perform a wider range of the tasks that the visual system performs, which suggests that they would need to be more powerful” (p. 2). From the non-verbatim notes from my conversations with Prof. Konrad Kording: “‘What things are’ isn’t the only question at stake in vision. You want answers to questions like “can I grasp this water bottle? Can I hold it there?”. Indeed, there are a vast number of questions that we want to be able to ask and answer with vision systems, and the “solution” to vision will depend on the exact thing that other parts of the brain need from the visual system. It’s not an easily definable space, and the only way to figure it out is to build a system that fully learns all of the relevant pieces” (p. 4). From Open Philanthropy’s non-verbatim notes from a conversation with Prof. Eric Jonas: “Prof. Jonas is fairly confident that the visual system is not classifying objects into one of k categories” (p. 1).

528.See Serre (2019), section 5.2, for a review.

529.Hendricks et al. (2020): “We introduce natural adversarial examples–real-world, unmodified, and naturally occurring examples that cause machine learning model performance to substantially degrade. We introduce two new datasets of natural adversarial examples. The first dataset contains 7,500 natural adversarial examples for ImageNet classifiers and serves as a hard ImageNet classier test set called IMAGENET-A. We also curate an adversarial out-of-distribution detection dataset called IMAGENET-O, which to our knowledge is the first out-of-distribution detection dataset created for ImageNet models. These two datasets provide new ways to measure model robustness and uncertainty. Like lp adversarial examples, our natural adversarial examples transfer to unseen black-box models. For example, on IMAGENET-A a DenseNet-121 obtains around 2% accuracy, an accuracy drop of approximately 90%, and its out-of-distribution detection performance on IMAGENET-O is near random chance levels. Popular training techniques for improving robustness have little effect, but some architectural changes provide mild improvements. Future research is required to enable generalization to natural adversarial examples” (p. 1).

530.Elsayed et al. (2018): “Machine learning models are vulnerable to adversarial examples: small changes to images can cause computer vision models to make mistakes such as identifying a school bus as an ostrich. However, it is still an open question whether humans are prone to similar mistakes. Here, we address this question by leveraging recent techniques that transfer adversarial examples from computer vision models with known parameters and architecture to other models with unknown parameters and architecture, and by matching the initial processing of the human visual system. We find that adversarial examples that strongly transfer across computer vision models influence the classifications made by time-limited human observers” (p. 1). A full test of whether humans are comparably vulnerable to adversarial examples, though, might require the ability to access and manipulate the parameters of the human brain in the same manner that one can with an artificial neural network.

531.Barbu et al. (2019): “When tested on ObjectNet, object detectors show a 40-45% drop in performance, with respect to their performance on other benchmarks, due to the controls for biases. Controls make ObjectNet robust to fine-tuning showing only small performance increases” (p. 1).

532.Geirhos et al. (2020) discusses a number of examples. Serre (2019), section 5.2, discusses various generalization failures. See also Recht et al. (2019): “We build new test sets for the CIFAR-10 and ImageNet datasets. Both benchmarks have been the focus of intense research for almost a decade, raising the danger of overfitting to excessively reused test sets. By closely following the original dataset creation processes, we test to what extent current classification models generalize to new data. We evaluate a broad range of models and find accuracy drops of 3% – 15% on CIFAR-10 and 11% – 14% on ImageNet. However, accuracy gains on the original test sets translate to larger gains on the new test sets. Our results suggest that the accuracy drops are not caused by adaptivity, but by the models’ inability to generalize to slightly “harder” images than those found in the original test sets” (p. 1); Lamb et al. (2019): “humans are able to watch cartoons, which are missing many visual details, without being explicitly trained to do so…We propose a dataset that will make it easier to study the detail-invariance problem concretely. We produce a concrete task for this: SketchTransfer, and we show that state-of-the-art domain transfer algorithms still struggle with this task. The state-of-the-art technique which achieves over 95% on MNIST −→ SVHN transfer only achieves 59% accuracy on the SketchTransfer task, which is much better than random (11% accuracy) but falls short of the 87% accuracy of a classifier trained directly on labeled sketches. This indicates that this task is approachable with today’s best methods but has substantial room for improvement” (p. 1); and Rosenfeld et al. (2018): “We showcase a family of common failures of state-of-the art object detectors. These are obtained by replacing image sub-regions by another sub-image that contains a trained object. We call this ‘object transplanting’. Modifying an image in this manner is shown to have a non-local impact on object detection. Slight changes in object position can affect its identity according to an object detector as well as that of other objects in the image. We provide some analysis and suggest possible reasons for the reported phenomena” (p. 1).

533.Jenkins et al. (2018) for example, found that “people know about 5000 faces on average” (p. 1) and Biederman (1987) estimates that people know 30,000 distinguishable object categories, though he treats this as “liberal” (e.g., on the high end). I have not attempted to evaluate his methodology, but at a glance it looks both loose and based on fairly substantive assumptions. Here is a relevant quote: “How many readily distinguishable objects do people know? How might one arrive at a liberal estimate for this value? One estimate can be obtained from the lexicon. There are less than 1,500 relatively common basic-level object categories, such as chairs and elephants. If we assume that this estimate is too small by a factor of 2, allowing for idiosyncratic categories and errors in the estimate, then we can assume potential classification into approximately 3,000 basic-level categories. RBC assumes that perception is based on a particular componential configuration rather than the basic-level category, so we need to estimate the mean number of readily distinguishable componential configurations per basic-level category. Almost all natural categories, such as elephants or giraffes, have one or only a few instances with differing componential descriptions. Dogs represent a rare exception for natural categories in that they have been bred to have considerable variation in their descriptions. Categories created by people vary in the number of allowable types, but this number often tends to be greater than the natural categories. Cups, typewriters, and lamps have just a few (in the case of cups) to perhaps 15 or more (in the case of lamps) readily discernible exemplars. Let us assume (liberally) that the mean number of types is 10. This would yield an estimate of 30,000 readily discriminable objects (3,000 categories × 10 types/category)” (p. 127). See also Open Philanthropy’s non-verbatim notes from a conversation with Dr. Kate Storrs: “The question of how many categories humans can recognize is sort of impossible, because the concept of a category is fairly fuzzy, and it isn’t rich enough to capture what human visual recognition involves. For example, you’ve probably seen tens of thousands of chairs over the course of your life. You were able to immediately recognize them as chairs, but you were also able to immediately see a large number of individuating properties. Indeed, one of the great powers of the visual system is that it arrives at a description that is flexible enough that you can then carve it up in whatever ways are behaviorally relevant. Looking at common nouns, and budgeting a certain number of instances of each (maybe 100 or 1000) as individually recognizable, might be one way to put a very rough number on the categories that humans can recognize.” (p. 4).

534.Another example might be an image-classification task that involves classifying images into “funny” and “not funny” – a task hardly limited in difficulty by the number of basic objects humans can identify. See Karpathy (2012) for discussion of all of the complex understanding that goes into appreciating a humorous picture: “the point here is that you’ve used a HUGE amount of information in that half second when you look at the picture and laugh. Information about the 3D structure of the scene, confounding visual elements like mirrors, identities of people, affordances and how people interact with objects, physics (how a particular instrument works, leaning and what that does), people, their tendency to be insecure about weight, you’ve reasoned about the situation from the point of view of the person on the scale, what he is aware of, what his intents are and what information is available to him, and you’ve reasoned about people reasoning about people. You’ve also thought about the dynamics of the scene and made guesses about how the situation will unfold in the next few seconds visually, how it will unfold in the thoughts of people involved, and you reasoned about how likely or unlikely it is for people of particular identity/status to carry out some action. Somehow all these things come together to ‘make sense’ of the scene.”

535.Dr. Dario Amodei suggested this consideration. Sarpeshkar (2010) treats the retina as receiving 36Gb/s, and outputing 20 Mb/s (p. 749, he cites Koch et al. (2004)).

536.See here: “224×224×3, a typical size for an image classifier.” See here for some example input sizes.

537.Geirhos et al. (2018): “Here we proposed a fair and psychophysically accurate way of comparing network and human performance on a number of object recognition tasks: measuring categorization accuracy for single-fixation, briefly presented (200 ms) and backward-masked images as a function of colour, contrast, uniform noise, and eidolon-type distortions. We find that DNNs outperform human observers by a significant margin for non-distorted, coloured images—the images the DNNs were specifically trained on… In comparison to human observers, we find the classification performance of three currently well-known DNNs trained on ImageNet—AlexNet, GoogLeNet and VGG-16—to decline rapidly with decreasing signal-to-noise ratio under image degradations like additive noise or eidolon-type distortions” (p. 14-17). See also Figures 2 and 3.

538.From Open Philanthropy’s non-verbatim notes from a conversation with Dr. Kate Storrs: “On the other hand, a lot of our impression of the richness of human vision is illusory. For example, we don’t see crisply, or in color, in the periphery of our visual field. So perhaps biological vision uses its own shortcuts” (p. 2).

539.This is a point suggested by Dr. Dario Amodei. The Cerebras whitepaper suggests that “50 to 98% of your multiplications are wasted” on non-sparse hardware (p. 5).

540.Ravi (2018): “For example, on ImageNet task, Learn2Compress achieves a model 22× smaller than Inception v3 baseline and 4× smaller than MobileNet v1 baseline with just 4.6-7% drop in accuracy. On CIFAR-10, jointly training multiple Learn2Compress models with shared parameters, takes only 10% more time than training a single Learn2Compress large model, but yields 3 compressed models that are upto 94× smaller in size and upto 27× faster with up to 36× lower cost and good prediction quality (90-95% top-1 accuracy).” See also Frankle and Carbin (2018): “Neural network pruning techniques can reduce the parameter counts of trained networks by over 90%, decreasing storage requirements and improving computational performance of inference without compromising accuracy” (p. 1); and Lillicrap and Kording (2019): “From distillation techniques we know that networks trained on ImageNet, a popular 2012 machine learning benchmark that requires the classification of natural images, cannot readily be compressed to fewer than about 100k free parameters [13, 20, 32] (though see [35])” (p. 3). Note also that other models are less efficient than EfficientNet-B2. For example, a ResNet-101 requires ~1e10 FLOPs, and models that both identify and localize objects, that assign the pixels in each image to different objects, or that identify points of interest in a scene, can require more than 1e11 FLOPs per forward pass. See here for examples..

541.From Open Philanthropy’s non-verbatim notes from a conversation with Prof. Won Mok Shim: “There is a fair amount of consensus in the field that the human visual system can recognize about ten images per second (e.g., one image per 100 ms). However, this doesn’t mean that it takes 100 ms to recognize an image. For example, you might be able to recognize an image shown very briefly (e.g., for less than 100 ms), but without sequences of other images before and afterwards” (p. 3). Trafton’s (2014) MIT news article suggests that 10 images per second has been suggested by previous studies. Potter et al. (2013), however, suggests that humans can at least do better than chance at images presented for only 13 ms: “The results of both experiments show that conceptual understanding can be achieved when a novel picture is presented as briefly as 13 ms and masked by other pictures” (p. 275, see also further discussion on p. 276); and Keysers et al. (2001) report that “macaque monkeys were presented with continuous rapid serial visual presentation (RSVP) sequences of unrelated naturalistic images at rates of 14–222 msec/image, while neurons that responded selectively to complex patterns (e.g., faces) were recorded in temporal cortex. Stimulus selectivity was preserved for 65% of these neurons even at surprisingly fast presentation rates (14 msec/image or 72 images/sec). Five human subjects were asked to detect or remember images under equivalent conditions. Their performance in both tasks was above chance at all rates (14–111 msec/image)”. That said, “better than chance” is too low a standard. Potter et al. (2013) also report that “a picture as brief as 20 ms is easy to see if it is followed by a blank visual field (e.g., Thorpe, Fize, and Marlot (1996))” (p. 270).

542.Carandini (2012): “Thanks to high neuronal density and large area, V1 contains a vast number of neurons. In humans, it contains about 140 million neurons per hemisphere (Wandell, 1995), i.e. about 40 V1 neurons per LGN neuron” (from the introduction).

543.For example, one recent estimate by Miller et al. (2014), using better methods, finds 675 million neurons for chimpanzee V1 as a whole. Another – Collins et al. (2016) – finds 737 million neurons in just onechimpanzee V1 hemisphere, suggesting ~1.4 billion in V1 as a whole. The human cortex has ~2× the neurons of the chimpanzee cortex, suggesting something like 1-3 billion for human V1. Mora-Bermúdez et al. (2016): “The human brain is about three times as big as the brain of our closest living relative, the chimpanzee. Moreover, a part of the brain called the cerebral cortex – which plays a key role in memory, attention, awareness and thought – contains twice as many cells in humans as the same region in chimpanzees.”

544.Though Collins et al. (2016) find ~400 million in one hemisphere on chimpanzee V2, suggesting 800 million for chimp V2 as a whole, and 1.6 billion for human V2, if we assume similar ratios in the cortex.

545.The high-end here is more than half of the neurons in the cortex as a whole (~16 billion neurons, according to Azevedo et al. (2016) (p. 536), which seems high to me, based on eyeballing pictures of the visual cortex. That said, neuron density in primate visual cortex appears to be unusually high (see Collins et al. (2016): “the packing densities of neurons in V1 were 1.2, 2.1, 3.3, and 3.5 times greater than neuron densities in secondary visual cortex (V2) and somatosensory, motor, and premotor cortices, respectively” (“Visual areas of the cortex”), numbers in this range do seem to fall out of extrapolation from the chimpanzee data, and ~50% of the cortex is compatible with comments from Prof. Konrad Kording to the effect that ~half of the brain’s hardware is involved in processing vision in some way. From Open Philanthropy’s non-verbatim notes from a conversation with Prof. Konrad Kording: “The human brain dedicates roughly half of its hardware to processing vision (this can be seen by looking at diagrams created by David Van Essen). And we can solve a lot of the vision problem (e.g., detecting objects, segmenting scenes, storing information) using very modest compute” (p. 1).

546.See my discussion of the cerebellum in Section 2.4.2.3. Though note that neuron densities in V1 are especially high. See Collins et al. (2016): “the packing densities of neurons in V1 were 1.2, 2.1, 3.3, and 3.5 times greater than neuron densities in secondary visual cortex (V2) and somatosensory, motor, and premotor cortices, respectively” (“Visual areas of the cortex”).

547.One could also ask questions like: “how many fewer neurons could this region have/how much less energy could it use, if evolution got to rebuild it from scratch, without needing to do task X, but still needing to do everything else it does?” But these are hard to answer.

548.Drexler (2019) appears to have something like this in mind: “A key concept in the following will be “immediate neural activity” (INA), an informal measure of potentially task-applicable brain activity. As a measure of current neural activity potentially applicable to task performance, INA is to be interpreted in an abstract, information-processing sense that conceptually excludes the formation of long-term memories (as discussed below, human and machine learning are currently organized in fundamentally different ways)” (p. 183-184)

549.My thanks to Dr. Eric Drexler for discussion.

550.Here’s one loose attempt to estimate (1). Following the data in Cadena et al. (2019), suppose that for half of the neurons in V1, ~28% of the variance is explained by the visual stimulus, and ~50% of that can be explained by networks trained on object recognition. To be conservative, let’s assume that none of the variance in the activity of the other half of V1 neurons is explained by visual stimuli at all. This would suggest that at least 7% of variance in V1 neural activity overall can be explained by such models (here I’m following a version of the methodology in Olshausen and Field (2005), who suggest that “If we consider that roughly 40% of the population of neurons in V1 has actually been recorded from and characterized, together with our conjecture that 30% to 40% of the response variance of these neurons can be explained under natural conditions using the currently established models, then we are left to conclude that we can currently account for 12% to 16% of V1 function. Thus, approximately 85% of V1 function has yet to be explained” (p. Higher estimates could incorporate all the data listed on http://www.brain-score.org/, which I haven’t tried to interpret, but which appears to suggest a substantial amount of variance explained. From Schrimpf et al. (2018): “The best ANN model V4 predictivity score is 0.663, which is below the internal consistency ceiling of these V4 data (0.892). The best ANN model IT predictivity score is 0.604, which is below the internal consistency ceiling of these IT data (0.817). And the best ANN model behavioral predictivity score is 0.378, which is below the internal consistency ceiling of these behavioral data (0.497)” (p. 7). See also Storrs et al. (2020): “We find that trained models significantly outperform untrained models (accounting for 57% more of the explainable variance), suggesting that features representing natural images are important for explaining hIT. Model fitting further improves the alignment of DNN and hIT representations (by 124%), suggesting that the relative prevalence of different features in hIT does not readily emerge from the particular ImageNet object-recognition task used to train the networks” (abstract).

551.See e.g. Open Philanthropy’s non-verbatim notes from a conversation with Dr. Kate Storrs: “In Dr. Storrs’ area of neuroscience, there can be a narrative to the effect that: “the early visual system is basically done. We understand the canonical computations: e.g., edge, orientation and color selection. You link them up with local exhibition and inhibition, and you have feedback that probably has some kind of predictive function (e.g., you get less and less response from V1 neurons to a predictable stimulus, suggesting that feedback is creating some kind of short-term memory). Once you’ve got all of this, you can explain most of V1 activity.” (This is not necessarily Dr. Storrs’ view; it’s just a summary of a common narrative.)” (p. 3).

552.Open Philanthropy’s technical advisor, Dr. Dario Amodei, suggests that V1 might be a helpful point of focus (ImageNet models plausibly cover functions in other parts of the visual cortex, but he suggests that basing estimates on V1 is conservative).

553.This is a variant on an analogy suggested by Nick Beckstead.

554.For example, FLOPs scaling for bigger inputs appears to be roughly linear: see e.g. here. Dr. Dario Amodei also suggested linear scaling for bigger inputs as a conservative adjustment.

555.Kolesnikov et al. (2020): “All of our BiT models use a vanilla ResNet-v2 architecture [16], except that we replace all Batch Normalization [21] layers with Group Normalization [60] and use Weight Standardization [43] in all convolutional layers. See Section 4.3 for analysis. We train ResNet-152 architectures in all datasets, with every hidden layer widened by a factor of four (ResNet152×4).” A ResNet-152 is 1e10 FLOPs for a forward pass, and my understanding is widening every hidden layer by a factor of four results in a ~16× increase in overall FLOPs, suggesting ~2e11 FLOPs.

556.Tan et al. (2020): “In particular, with single-model and single test-time scale, our EfficientDet-D7 achieves state-of-the-art 53.7 AP with 52M parameters and 325B FLOPs, outperforming previous best detector [44] with 1.5 AP while being 4× smaller and using 13× fewer FLOPs” (p. 2).

557.Others not included in the chart include Kurzweil’s (2012) for a “pattern recognition”: “emulating one cycle in a single pattern recognizer in the biological brain’s neocortex would require about 3,000 calculations. Most simulations run at a fraction of this estimate. With the brain running at about 10² (100) cycles per second, that comes to 3 × 10⁵ (300,000) calculations per second per pattern recognizer. Using my estimate of 3 × 10⁸ (300 million) pattern recognizers, we get about 10¹⁴ (100 trillion) calculations per second” (p. 195). Kurzweil (2005) also suggests that “Yet another estimate comes from a simulation at the University of Texas that represents the functionality of a cerebellum region containing 10⁴ neurons; this required about 10⁸ cps, or about 10⁴ cps per neuron. Extrapolating this over an estimated 10¹¹ neurons results in a figure of about 10¹⁵ cps for the entire brain” (p. 123).

558.Drexler (2019): “Baidu’s Deep Speech 2 system can approach or exceed human accuracy in recognizing and transcribing spoken English and Mandarin, and would require approximately 1 GFLOP/s per real-time speech stream (Amodei et al. 2015). For this roughly human-level throughput, fPFLOP = 10−6 [fPFLOP is the fraction of a petaFLOP that a given number of FLOPs represents]. Turning to neural function again, consider that task-relevant auditory/semantic cortex probably comprises >1% of the human brain. If the equivalent of the Deep Speech 2 speech-recognition task were to require 10% of that cortex, then fINA = 10−3, and RPFLOP = 1000 [RPFLOP is the ratio of the fraction of the brain’s activity that a task represents, to the fraction of a petaFLOP that the compute to perform that task represents]” (p. 187). Dr. Dario Amodei also suggested an estimate in this vein.

559.Drexler (2019): “Google’s neural machine translation (NMT) systems have reportedly approached human quality (Wu et al. 2016). A multi-lingual version of the Google NMT model (which operates with the same resources) bridges language pairs through a seemingly language-independent representation of sentence meaning (Johnson et al. 2016), suggesting substantial (though unquantifiable) semantic depth in the intermediate processing. Performing translation at a human-like rate of one sentence per second would require approximately 100 GFLOP/s, and fPFLOP = 10−4. It is plausible that (to the extent that such things can be distinguished) human beings mobilize as much as 1% of global INA at an “NMT task-level”— involving vocabulary, syntax, and idiom, but not broader understanding— when performing language translation. If so, then for “NMT-equivalent translation,” we can propose fINA = 10−2, implying RPFLOP = 100” (p. 187-188).

560.Kurzweil (2005): “Another estimate comes from the work of Lloyd Watts and his colleagues on creating functional simulations of regions of the human auditory system, which I discuss further in chapter 4. One of the functions of the software Watts has developed is a task called “stream separation,” which is used in teleconferencing and other applications to achieve telepresence (the localization of each participant in a remote audio teleconference). To accomplish this, Watts explains, means ‘precisely measuring the time delay between sound sensors that are separated in space and that both receive the sound.’ The process involves pitch analysis, spatial position, and speech cues, including language-specific cues. ‘One of the important cues used by humans for localizing the position of a sound source is the Interaural Time Difference (ITD), that is, the difference in time of arrival of sounds at the two ears.’ Watts’s own group has created functionally equivalent re-creations of these brain regions derived from reverse engineering. He estimates that 10¹¹ cps are required to achieve human-level localization of sounds. The auditory cortex regions responsible for this processing comprise at least 0.1 percent of the brain’s neurons. So we again arrive at a ballpark estimate of around 10¹⁴ cps (10¹¹ cps × 10³)” (p. 123).

561.Kell et al. (2018): “…we optimized hierarchical neural networks for speech and music recognition. The best-performing network contained separate music and speech pathways following early shared processing, potentially replicating human cortical organization. The network performed both tasks as well as humans and exhibited human-like errors despite not being optimized to do so, suggesting common constraints on network and human performance. The network predicted fMRI voxel responses substantially better than traditional spectrotemporal filter models throughout auditory cortex. It also provided a quantitative signature of cortical representational hierarchy—primary and non-primary responses were best predicted by intermediate and late network layers, respectively. The results suggest that task optimization provides a powerful set of tools for modeling sensory systems” (p. 630).

562.Banino et al. (2018): “Grid cells are thought to provide a multi-scale periodic representation that functions as a metric for coding space7,8 and is critical for integrating self-motion (path integration)6,7,9 and planning direct trajectories to goals (vector-based navigation)7,10,11. Here we set out to leverage the computational functions of grid cells to develop a deep reinforcement learning agent with mammal-like navigational abilities… Our findings show that emergent grid-like representations furnish agents with a Euclidean spatial metric and associated vector operations, providing a foundation for proficient navigation. As such, our results support neuroscientific theories that see grid cells as critical for vector-based navigation 7,10,11, demonstrating that the latter can be combined with path-based strategies to support navigation in challenging environments” (abstract). Cueva and Wei (2018): “we trained recurrent neural networks (RNNs) to perform navigation tasks in 2D arenas based on velocity inputs. Surprisingly, we find that grid-like spatial response patterns emerge in trained networks, along with units that exhibit other spatial correlates, including border cells and band-like cells. All these different functional types of neurons have been observed experimentally. The order of the emergence of grid-like and border cells is also consistent with observations from developmental studies. Together, our results suggest that grid cells, border cells and others as observed in EC may be a natural solution for representing space efficiently given the predominant recurrent connections in the neural circuits” (p. 1).

563.Merel et al. (2020): “In this work we develop a virtual rodent that learns to flexibly apply a broad motor repertoire, including righting, running, leaping and rearing, to solve multiple tasks in a simulated world. We analyze the artificial neural mechanisms underlying the virtual rodent’s motor capabilities using a neuroethological approach, where we characterize neural activity patterns relative to the rodent’s behavior and goals. We show that the rodent solves tasks by using a shared set of force patterns that are orchestrated into task-specific behaviors over longer timescales. Through methods familiar to neuroscientists, including representational similarity analysis, dimensionality reduction techniques, and targeted perturbations, we show that the networks produce these behaviors using at least two classes of behavioral representations, one that explicitly encodes behavioral kinematics in a task-invariant manner, and a second that encodes task-specific behavioral strategies. Overall, the virtual rat promises to facilitate grounded collaborations between deep reinforcement learning and motor neuroscience” (p. 1).

564.Lloyd (2000): “The amount of information that can be stored by the ultimate laptop, ≈ 10³¹ bits, is much higher than the ≈ 10¹⁰ bits stored on current laptops. This is because conventional laptops use many degrees of freedom to store a bit where the ultimate laptop uses just one. There are considerable advantages to using many degrees of freedom to store information, stability and controllability being perhaps the most important. Indeed, as the above calculation indicates, in order to take full advantage of the memory space available, the ultimate laptop must turn all its matter into energy. A typical state of the ultimate laptop’s memory looks like a plasma at a billion degrees Kelvin: the laptop’s memory looks like a thermonuclear explosion or a little piece of the Big Bang! Clearly, packaging issues alone make it unlikely that this limit can be obtained, even setting aside the difficulties of stability and control” (p. 11).

565.See calculations in Section 4.2.

566.My thanks to Prof. David Wallace for discussion.

567.My thanks to Prof. David Wallace for suggesting this example.

568.From Open Philanthropy’s non-verbatim notes from a conversation with Dr. Paul Christiano: “The algorithmic overhead involved in reversible computing (specifically, the overhead involved in un-computing what you have already computed) is not that bad. Most of the difficulty lies in designing such efficient hardware. Partly for this reason, Dr. Christiano does not think that you can get an upper bound on the FLOP/s required to do what the brain does, purely by appealing to the energy required to erase bits. We believe that you can perform extremely complex computations with almost no bit erasures using good enough hardware” (p. 4). For discussion of some ongoing controversy related to the bit-erasures involved in reading/writing inputs and outputs repeatedly, see Wolpert (2019), Open Philanthropy’s non-verbatim notes from a conversation with Prof. David Wolpert (p. 2), and Open Philanthropy’s non-verbatim notes from a conversation with Dr. Jess Riedel (p. 5).

569.From Open Philanthropy’s non-verbatim notes from a conversation with Dr. Michael Frank (p. 2):

Dr. Frank thinks that it is possible that there are processes in the brain that are close to thermodynamically reversible, and that play a role in computation. We don’t know enough about the brain to answer confidently either way…We don’t have positive evidence that such reversible effects exist and are important to cognition, but we also don’t have positive evidence that rules this out. However, Dr. Frank thinks that it’s a reasonable first-order assumption to assume that those effects, if they exist, would only have a small, second-order effect on the amount of computational work required to simulate the system. If these effects are there, they may be fairly subtle and gradual, acting in a long-term way on the brain, in a manner we are not close to understanding…Overall, Dr. Frank would lean weakly towards the view that you could make a digital model of cognition without including any subtle reversible processes, but because he is not an expert on the neural computation, he would not bet confidently one way or another.

From Open Philanthropy’s non-verbatim notes from a conversation with Dr. Stephen Larson (p. 4):

Dr. Larson is not persuaded that Landauer’s limit can be used to upper-bound the FLOP/s necessary to replicate the brain’s task-performance, as it seems possible to him that there could be computational processes occurring in the brain that do not require bit-erasures.

Prof. David Wallace was also skeptical that Landauer’s principle could be used to generate an informative upper bound on required FLOP/s.

570.From Open Philanthropy’s non-verbatim notes from a conversation with Prof. Jared Kaplan (p. 2):

Mr. Carlsmith asked Prof. Kaplan’s opinion of the following type of upper bound on the compute required to replicate the brain’s task-performance. According to Landauer’s principle, the brain, given its energy budget (~20 W) can be performing no more than ~1e22 bit-erasures per second. And if the brain is performing less than 1e22 bit-erasures per second, the number of FLOP/s required to replicate its task-performance is unlikely to exceed 1e22. Prof. Kaplan thinks that this type of calculation provides a very reasonable loose upper bound on the computation performed by the brain, and that the actual amount of computation performed by the brain is almost certainly many orders of magnitude below this bound. Indeed, he thinks the true number is so obviously much lower than this that Landauer’s principle does not initially seem particularly germane to questions about brain computation. One analogy might be attempting to upper bound the number of fraudulent votes in a US presidential election via the total population of the world. However, he thinks that upper bounds based on Landauer’s principle are a helpful counter to views on which ‘we really just don’t know’ how much computation the brain performs, or on which doing what the brain does requires the type of compute that would be implicated by very detailed biophysical simulations.

From Open Philanthropy’s non-verbatim notes from a conversation with Dr. Jess Riedel (p. 2-3):

Dr. Riedel is very convinced by the claim that because of Landauer’s principle, the brain can be implementing no more than ~1e22 bit-erasures per second. And he also thinks it very reasonable to infer from this that the brain’s task performance can be replicated using less than 1e22 FLOP/s, conditional on the assumption that the brain’s computation is well-characterized as digital and/or analog computation that can be simulated on a digital computer with modest overhead (he assigns some small probability to this assumption being false, though he would find its falsehood fairly shocking). Indeed, Dr. Riedel expects the amount of computation performed by the brain to be much lower than the upper bound implied by Landauer’s principle. This is partly because, from a basic physics perspective, the vast majority of what’s going on in the brain (e.g., cell maintenance, other thermodynamic processes inside cells) generates entropy but has nothing to do with the computations that are happening.

From Open Philanthropy’s non-verbatim notes from a conversation with Dr. Paul Christiano (p. 5):

Dr. Christiano expects that experts in physics, chemistry, and computer engineering would generally think it extremely unlikely that the brain is erasing less than one bit per computationally useful FLOP it performs. If the brain were doing this, Dr. Christiano believes that this would mean that the brain is qualitatively much more impressive than any other other biological machinery we are aware of…Dr. Christiano would be extremely surprised if the brain got more computational mileage out of a single ATP than human engineers can get out of a FLOP, and he would be very willing to bet that it takes at least 10 ATPs to get the equivalent of a FLOP. Mr. Carlsmith estimates that the brain can be using no more than ~1e20 ATPs/second. If this estimate is right, then Dr. Christiano is very confident that you do not need more than 1e20 FLOP/s to replicate the brain’s task-performance.

See also Open Philanthropy’s non-verbatim notes from a conversation with Prof. David Wolpert (p. 3-4) for more discussion, though with less of an obvious upshot:

Mr. Carlsmith asked Prof. Wolpert whether one can use Landauer’s principle to upper bound the FLOP/s required to replicate the human brain’s task-performance… In Prof. Wolpert’s view, it is a subtle and interesting question how to do this type of calculation correctly. A rigorous version would require a large research project… Prof. Wolpert’s thinks that this calculation is legitimate as a first-pass, back-of-the-envelope upper bound on the bit-erasures that the brain could be implementing. It couldn’t get published in a physics journal, but it might get published in a popular science journal, and it helps get the conversation started. At a minimum, it’s a strong concern that advocates of extreme amounts of computational complexity in the brain (for example, advocates of the view that you need much more than 1e22 FLOP/s to replicate the brain’s computation) would need to address.

571.This deference is not merely the result of tallying up the amount of expert support for different perspectives: it incorporates many more subjective factors involved in my evaluation of the overall evidence the expert opinions I was exposed to provides.

572.From Open Philanthropy’s non-verbatim notes from a conversation with Prof. Konrad Kording: “Examination of neurons reveals that they are actually very non-linear, and the computations involved in plasticity probably include a large number of factors distributed across the cell. In this sense, a neuron might be equivalent to a three-layer neural network, internally trained using backpropagation. In that case, you’d need to add another factor of roughly 10⁵ to your compute estimate, for a total of 10²⁰ multiplications per second. This would be much less manageable. … The difference between the estimates generated by these different approaches is very large – something like ten orders of magnitude. It’s unclear where the brain is on that spectrum” (p. 2). From Open Philanthropy’s non-verbatim notes from a conversation with Prof. Eric Jonas: “Attempting to estimate the compute sufficient to replicate the brain’s task performance is an extremely challenging project. It’s worthwhile (indeed, it’s a common thought experiment amongst neuroscientists), but the error bars will be huge (e.g., something like ten orders of magnitude) … Active dendritic computation could conceivably imply something like 1-5 orders of magnitude more compute than a simple linear summation model of a neuron” (p. 3). If a simple linear summation model implies ~1e13-1e15 FLOP/s – e.g., ~1 FLOP per spike through synapse – this would suggest a range of 1e13-1e20 FLOP/s. From Open Philanthropy’s non-verbatim notes from a conversation with Prof. Erik De Schutter: “Prof. De Schutter thinks that at this point, we simply are not in a position to place any limits on the level of biological detail that might be relevant to replicating the brain’s task-performance” (p. 1). Sandberg and Bostrom (2008) (p. 13), report that in an informal poll of attendees at a conference about the required level of resolution for whole brain emulation, the consensus appeared to be one of the following three levels: “Spiking neural network,” which Sandberg and Bostrom estimate would require 1e18 FLOP/s; “Electrophysiology,” which Sandberg and Bostrom estimate would require 1e22 FLOP/s; and “Metabolome,” which Sandberg and Bostrom estimate would require 1e25 FLOP/s; Henry Markham, in a 2018 video (18:28), estimates the FLOP/s burdens of running a “real-time molecular simulation of the human-brain” at 4e29 FLOP/s (and see here for some arguments in which he seems to suggest that levels of detail in this vein are central to counting as a simulation of the brain); and Bell (1999) appears to suggest that we cannot be confident that even a molecular level simulation of the brain would be adequate (p. 2018).

573.I’ve mostly relied on Frank (2018), Sagawa (2014), Wolpert (2019), and Wolpert (2019a) for my understanding of the principle, together (centrally) with discussion with experts. Feyman (1996), Chapter 5, also contains a fairly accessible introduction. See Landauer (1961) for the original statement of the argument: “It is argued that computing machines inevitably involve devices which perform logical functions that do not have a single-valued inverse. This logical irreversibility is associated with physical irreversibility and requires a minimal heat generation, per machine cycle, typically of the order of kT for each irreversible function. This dissipation serves the purpose of standardizing signals and making them independent of their exact logical history” (p. 183).

574.Here I am following Frank (2018): “Let there be a countable (usually finite) set C = {c_i} of distinct entities ci called computational states. Then a general definition of a (possibly stochastic) (computational) operation O is a function O : C → P(C), where P(C) denotes the set of probability distributions over C. That is, O(c_i) for any given c_i ∈ C is some corresponding probability distribution P_i : C → [0, 1]. The intent of this definition is that, when applied to an initial computational state c_i, the computational operation transforms it into a final computational state c_i, but in general, this process could be stochastic, meaning that, for whatever reason, having complete knowledge of the initial state does not imply having complete knowledge of the final state” (p. 11). See Maroney (2005) for more discussion of stochastic computation in the context of Landauer’s principle.

575.Schroeder (2000): “Entropy is just the logarithm of the number of ways of arranging things in the system (times Boltzmann’s constant)” (p. 75). See also Wikipedia on Boltzmann’s principle. From Open Philanthropy’s non-verbatim notes from a conversation with Prof. Jared Kaplan: “Landauer’s principle states that erasing a bit of information requires a minimum energy expenditure – specifically, kT ln2, where k is Boltzmann’s constant, and T is the absolute temperature. This principle is grounded in the relationship between entropy and energy – the same relationship that grounds the fact that heat doesn’t flow from cold things to hot things, and the fact that you can’t create a perpetual motion machine or an arbitrarily efficient engine. For physicists, entropy is the logarithm of the number of accessible states. When a system changes, either this entropy stays the same, or it increases…” (p. 1).

576.I am using the term “logical bit-erasures” to quantify logical entropy drops of the kind to which Landauer’s principle, as I understand it, is relevant, even in a stochastic context. Discussions of Landauer’s principle sometimes assume a deterministic context, in which the relationship between decreases in logical entropy and logical irreversibility (e.g., the inability to reconstruct inputs on the basis of outputs) is more straightforward (e.g., logically irreversible operations necessarily decrease logical entropy). Stochastic contexts introduce more complexities (see e.g. Frank (2018) and Maroney (2018) for some discussion), but as I understand it, the basic fact that decreasing logical entropy implicates Landauer costs remains unaltered. See also Kempes et al. (2017), who use a similar way of measuring Landauer costs in articulating what they call the “generalized Landauer bound” (p. 7), e.g.: “to focus on the specifically computation-based thermodynamic cost of a process, suppose that at any given time t all states x have the same energy. It is now known that in this situation the minimal work required to transform a distribution P₀(x) at time 0 to a distribution P₁(x) at time 1 is exactly kT[S(P₀) − S(P₁)] where S(.) is Shannon entropy and x lives in a countable space X” (p. 6). See also Open Philanthropy’s non-verbatim notes from a conversation with Prof. David Wolpert: “The generalized Landauer bound tells you the energy costs of performing a computation in a thermodynamically reversible way – energy that you could in principle get back. In particular: if you’re connected to a single heat bath, then regardless of whether your computation is deterministic or noisy, the generalized Landauer’s bound says that the minimum free energy you need to expend (assuming you perform the computation in a thermodynamically reversible way) is kT multiplied by the drop in the entropy. The total energy costs of a computation will then be the Landauer cost, plus the extra energy dissipated via the thermodynamically irreversible aspects of the physical process. This extra energy cannot be recovered” (p. 2).

577.My (non-expert) understanding is that one way to loosely and informally express the basic idea here (without attempting to actually justify it technically) is that because the computer and the environment areassumed to be independent (at least with respect to the types of correlations we will realistically be able to keep track of), total entropy (call this S_tot) is simply the entropy of the computer (S_comp) plus the entropy of the environment (S_env). And because the logical states are simply sets of computer microstates, the overall entropy of the computer (call this S_comp) is just the logical entropy (S_log), plus the entropy of the computer conditioned on the logical state (call this, S_{comp | log}). So S_tot = S_log + S_{comp | log} + S_env. This means that according to the second law, if Slog goes down, then S_{comp | log} and/or S_env have to go up by an amount sufficient to render the total change in entropy non-negative (see Sagawa (2014) (p. 15-17), for a more formal description of this basic framework. See also Frank (2018), section 3.2, and especially p. 19; as well as his verbal description in this lecture (21:44)). And because the brain is a finite system with a finite capacity to absorb entropy, increasing S_{comp | log} can only go so far if your computer is continuously processing. Eventually, if S_log goes down, S_envmust go up by a corresponding amount (see Open Philanthropy’s non-verbatim notes from a conversation with Dr. Jess Riedel: “A system like a brain or a computer contains non-information-bearing degrees of freedom that can absorb a finite amount of entropy. However, because the brain/computer is continuously processing and using energy, you can’t keep dumping entropy into those degrees of freedom indefinitely. Eventually, you need to start pushing entropy into the environment. If we assume that the states of the computer and the environment are not correlated (or at least, not in a way that we can realistically keep track of), then the total entropy will be the entropy of the computer plus the entropy of the environment. If the entropy of the computer goes down, the entropy of the environment must go up” (p. 2)).

578.From Open Philanthropy’s non-verbatim notes from a conversation with Dr. Jess Riedel: “In certain rare environments, you can decrease entropy by paying costs in conserved quantities other than energy (for example, you can pay costs in angular momentum). But this is not relevant in the context of the brain.” See Vaccaro and Barnett (2011) for more discussion.

579.From Open Philanthropy’s non-verbatim notes from a conversation with Dr. Jess Riedel: “Landauer’s principle follows almost trivially from basic principles of thermodynamics. Indeed, it can be understood simply as a rewriting of the definition of temperature. At a fundamental level, temperature is defined via the change in energy per unit change in entropy (up to a proportionality constant, Boltzmann’s constant). The practical and folk definitions of temperature, which focus on the amount of energy in a system (e.g., the kinetic energy of vibrating atoms), can be recovered from this more fundamental definition in all but a small number of exceptional cases. As the energy in a non-exceptional system increases, the number of states it can be in (and hence its maximum possible entropy) increases as well. If you have a system with a certain amount of energy, and you want to decrease its entropy, you need to put that entropy somewhere else, because total entropy is non-decreasing. Temperature gives us the exchange rate between energy and entropy. If you want to put some unit of entropy into a heat bath, you have to pay an energy cost, and the temperature of the bath is that cost” (p. 2). See also Open Philanthropy’s non-verbatim notes from a conversation with Prof. Jared Kaplan: “Almost all fixed systems have more accessible states as the energy goes up. Temperature just is how the energy changes as the entropy changes (textbooks will often state this as: the reciprocal of the temperature is the derivative of the entropy with respect to the energy). As an intuitive example: if your system (e.g., a set of gas molecules) has no energy at all, then all your molecules are just lying on the floor. As you add energy, they can bounce around, and there many more configurations they can be in. The energy of a single moving particle is another example. It’s kinetic energy is ½mass velocity2. The velocity is a vector, which in a three dimensional space will live on some sphere. As you make the energy bigger, the surface area of this sphere increases. This corresponds to a larger number of accessible states (at the quantum mechanical level, these states are discrete, so you can literally count them)” (p. 1-2).

580.Schroeder (2000): “The temperature of a system is the reciprocal of the slope of its entropy vs. energy graph. The partial derivative is to be taken with the system’s volume and number of particles held fixed; more explicitly: 1/T = (dS/dUf)_N,V (3.5). From now on I will take equation 3.5 to be the definition of temperature. You may be wondering why I do not turn the derivative upside down, and write equation 3.5 as T = (dU/dS)_N,V (3.6). The answer is that there is nothing wrong with this, but it’s less convenient in practice, because rarely do you ever have a formula for energy in terms of entropy” (p. 88). See also Jared Kaplan’s notes on Statistical Mechanics & Thermodynamics, p. 24; Wikipedia, “Definition of thermodynamic temperature”); and the quotes in the previous endnote.

581.See Bennett (2003), section 2 (“Objections to Landauer’s principle”), for a description of the various objections, together with his replies (p. 502-508). Some aspects of the controversy, such as whether Landauer’s principle can exorcise Maxwell’s Demon without first assuming the second law (see e.g. Earman and Norton (1998) and Norton (2004)) are not relevant for our purposes, as assuming the truth of second law is not a dialectical problem in this context.

The objection that logical irreversibility does not imply thermodynamic irreversibility (see e.g. Maroney (2018)) might seem to have more force, as Landauer’s principle is indeed often understood as claiming or implying the contrary (see Maroney (2018) for description of these interpretations; see also Bub (2002) (p. 10):

a logically irreversible operation must be implemented by a physically irreversible device, which dissipates heat into the environment.

My own impression, from Open Philanthropy’s non-verbatim notes from a conversation with Prof. David Wolpert and from Sagawa (2014), is that this objection, applied to interpretations of Landauer’s principle inconsistent with it, is in fact correct, but that it does not alter the fact that bit-erasure requires transferring energy to the environment – it merely notes that such a transfer can, in principle, be performed in a thermodynamically reversible way. See e.g. Kempes et al. (2017) (p. 6-7); Wolpert’s (2019a) (p. 3); Sagawa (2014) (p. 12):

The logically irreversible erasure can be performed in a thermodynamically reversible manner in the quasi-static limit.

See also Open Philanthropy’s non-verbatim notes from a conversation with Prof. David Wolpert (p. 2). Maroney (2018), after arguing that “logical reversibility neither implies, nor is implied by, thermodynamic reversibility” (p. 1), nevertheless acknowledges on page 14 that:

This does not contradict Landauer (1961) in the least. All that Landauer can be said to have shown was that a resetting operation required a generation of heat in the environment. However, a confusion then appears to arise through the incorrect use of the term ‘dissipation’. In Landauer (1961) and in much of the surrounding literature ‘dissipation’ is used more or less interchangeably with ‘heat generation’. Strictly, dissipation should be used only when the conversion of work to heat arises through dissipative forces (such as those involving friction) which are thermodynamically irreversible. Forces which are thermodynamically reversible are non-dissipative.

That said, I have not attempted to evaluate this debate in detail, and I try, in the section, to remain neutral about it where possible (for example, I try to avoid the suggestion that bit erasure requires dissipating energy, as opposed to simply transferring it, though I don’t think I will have entirely avoided controversy: see e.g. Frank (2018) (p. 1), who argues that:

Landauer’s Principle is not about general entropy transfers; rather, it more specifically concerns the ejection of (all or part of) some correlated information from a controlled, digital form (e.g., a computed bit) to an uncontrolled, non-computational form, i.e., as part of a thermal environment.

I’m aware of at least one empirical result that presents itself as in tension with some versions of Landauer’s principle: López-Suárex et al. (2016) (though Kish (2016) (p. 1) suggests that their argument:

neglects the dominant source of energy dissipation, namely, the charging energy of the capacitance of the input electrode, which totally dissipates during the full (0-1-0) cycle of logic values.

López-Suárex et al. (2016) (p.3) also note that:

We stress here that our experiment does not question the so-called Landauer-reset interpretation, where a net decrease of physical entropy requires a minimum energy expenditure. What we have here is a logically irreversible computation, that is a generic process where a decrease in the amount of information between the output and the input is realized with an arbitrarily small energy dissipation; this shows that logical reversibility and physical reversibility have to be treated on independent bases.

Frank (2018) (p. 36-37) claims that:

the only experiments that have claimed to demonstrate violations of Landauer’s limit have been ones in which the experimenters misunderstood some basic aspect of the Principle, such as the need to properly generalize the definition of logical reversibility, which was the subject of [11, 12, 13], or the role of correlations that we explained in §3.3 above.

However, he does not give more details, in his 2018 paper, as to the experiments he has in mind or the misunderstandings he takes to be involved.

582.Wolpert (2019a): “This early work [by Landauer and Bennett] was grounded in the tools of equilibrium statistical physics. However, computers are highly nonequilbrium systems. As a result, this early work was necessarily semiformal, and there were many questions it could not address. On the other hand, in the last few decades there have been major breakthroughs in non-equilibrium statistical physics. Some of the most important of these breakthroughs now allow us to analyze the thermodynamic behavior of any system that can be modeled with a time-inhomogeneous continuous-time Markov chain (CTMC), even if it is open, arbitrarily far from equilibrium, and undergoing arbitrary external driving. In particular, we can now decompose the time-derivative of the (Shannon) entropy of such a system into an ‘entropy production rate’, quantifying the rate of change of the total entropy of the system and its environment, minus a ‘entropy flow rate’, quantifying the rate of entropy exiting the system into its environment. Crucially, the entropy production rate is non-negative, regardless of the CTMC. So if it ever adds a nonzero amount to system entropy, its subsequent evolution cannot undo that increase in entropy. (For this reason it is sometimes referred to as irreversible entropy production.) This is the modern understanding of the second law of thermodynamics, for systems undergoing Markovian dynamics. In contrast to entropy production, entropy flow can be negative or positive. So even if entropy flow increases system entropy during one time interval (i.e. entropy flows into the system), often its subsequent evolution can undo that increase” (see p. 2-3).

583.Prof. David Wallace indicated that most physicists accept Landauer’s principle. Though see Open Philanthropy’s non-verbatim notes from a conversation with Dr. Jess Riedel: “Landauer’s principle follows almost trivially from basic principles of thermodynamics… There is some dispute over Landauer’s limit in the literature. Whether the basic assumptions it follows from apply in the real world is somewhat subtle” (p. 2).

584.See the review in Frank (2018): “In 2012, Berut et al. tested Landauer’s Principle in the context of a colloidal particle trapped in a modulated double-well potential, an experimental setup designed to mimic the conceptual picture that we reviewed in Fig. 12. Their experimental results showed that the heat dissipated in the erasure operation indeed approached the Landauer value of kT ln 2 in the adiabatic limit. Also in 2012, Orlov et al. tested Landauer’s Principle in the context of an adiabatic charge transfer across a resistor, and verified that, in cases where the charge transfer is carried out in a way that does not erase known computational information, the energy dissipated can be much less than kT ln 2, which validates the theoretical rationale for doing reversible computing. In 2014, Jun et al. [7] carried an even more high-precision version of the Berut experiment, verifying again the Landauer limit, and that similar, logically-reversible operations can, in contrast, be done in a way that approaches thermodynamic reversibility. Finally, in 2018, Yan et al. [8] carried out a quantum-mechanical experiment demonstrating that Landauer’s Principle holds at the single-atom level” (p. 36-37).

585.Aiello (1997): “On the basis of in vivo determinations, the mass-specific metabolic rate of the brain is approximately 11.2 W/kg (watts per kilogram). This is over 22 times the mass-specific metabolic rate of skeletal muscle (0.4 W/kg) (Aschoff et al. (1971)). A large brain would, therefore, be a considerable energetic investment. For example, an average human has a brain that is about 1 kg larger than would be expected for an average mammal of our body size (65 kg) and the metabolic cost of this brain would be just under 5 times that of the brain of the average mammal (humans = 14.6 watts, average mammal = 3.0 watts) (Aiello and Wheeler (1995))” (see the section “The expensive brain”). Aiello and Wheeler (1995) contains the same estimate, citing Aschoff et al. (1971), which I have not attempted to access (and which appears to be in German). Sarpeshkar (1997): “The global power consumption of the brain has been measured numerous times by the Kety-Schmidt technique, and the measurements have generally been fairly consistent, even over 40 years. A recent measurement [38] yielded an oxygen uptake of 144 umol.100g-1.min-1. The glucose reaction yields, in in-vitro reactions, about 60 kJ/mol × 38 ATP/6 = 380 kJ/mol of oxygen consumed. The 60 kJ/mol. Value was obtained from [29]. The weight of the brain is about 1.3 kg [10]. Thus, the power consumption in watts is computed to 11.8W, a value that we shall round of 12 W” (p. 204, though in Sarpeshkar (2010) (p. 748), he uses the Aiello (1997) estimate above). Jabr (2012a), writing for Scientific American, estimates 12.6W. Merkle (1989) cites Kandel et al. (1985) (though without a page number) for a 25W estimate, though he assumes that only 10W is actually used for computation. Watts et al. (2018) write that “While making up only a small fraction of our total body mass, the brain represents the largest source of energy consumption—accounting for over 20% of total oxygen metabolism,” which would suggest ~16W if we used the ~80W estimate for the whole body cited in Aiello (1997). Various citations listed here say that 20% of body energy consumption goes to the brain, which the website’s author uses to generate an estimate of 20W for the brain, based on 100W consumption by the human body as a whole. My impression is that the 20% number is used in numerous other contexts (see e.g. Engl and Attwell (2015), who cite Kety (1957); Sokoloff (1960), and Rolfe and Brown (1997) – though I haven’t followed up on these citations).

586.Engl and Attwell (2015): “Current theoretical estimates and experimental data assessing the contribution of each ‘housekeeping’ process to the brain’s total energy budget are inconclusive for many processes, varying widely in some cases. Further research is needed to fill these gaps, and the 40% value shown (right), for the whole brain according to Astrup et al. (1981a), as opposed to the 25% assumed for grey matter in Fig. 1, is quite uncertain” (p. 3424, Figure 5).

587.See Howarth et al. (2012): “As panel A, but including non-signaling energy use, assumed to be 6.81 × 10²² ATP/s/m³, that is, 1/3 of the neuronal signaling energy, so that housekeeping tasks are assumed to account for 25% of the total energy use. On this basis, resting potentials use 15%, action potentials 16%, and synaptic processes 44% of the total energy use” (p. 1224, Figure 1).

588.See Engl and Attwell (2015) for some description of these tasks: “Perhaps surprisingly, a significant fraction of brain energy use (25–50%) in previous energy budgets has been assigned to non-signalling (so-called ‘housekeeping’) tasks, which include protein and lipid synthesis, proton leak across the mitochondrial membrane, and cytoskeletal rearrangements, the rate of ATP consumption on all of which is poorly understood” (p. 3418), though the Engl and Attwell emphasize that the methodology used to generate these estimates is quite uncertain.

589.See Figure 1.

590.Wang et al. (2014): “On average, deep brain temperature is less than 1°C higher than body temperature in humans, unless cerebral injury is severe enough to significantly disrupt the brain-body temperature regulation (Soukup et al., 2002)” (p. 6). Thanks to Asya Bergal for this citation. See also Nelson and Nunneley (1998): “Cerebral temperatures were generally insensitive to surface conditions (air temperature and evaporation rate), which affected only the most superficial level of the cerebrum” (abstract). Human body temperature is about 37 oC, 310 Kelvin.

591.From Open Philanthropy’s non-verbatim notes from a conversation with Dr. Jess Riedel: “The temperature relevant to applying Landauer’s limit to the brain is essentially that of the skull and blood. Even if the temperature outside the body is at a lower temperature, the brain will have to push entropy into its environment via those conduits. If there were some other cold reservoir inside the brain absorbing entropy (there isn’t), it would quickly be expended” (p. 3). Sandberg (2016), in his attempt to apply Landauer’s limit to the brain, uses body temperature as well (see p. 5).

592.See calculation here.

593.See calculation here. Sandberg’s (2016) estimate is slightly higher: “20 W divided by 1.3 × 10^-21 J (the Landauer limit at body temperature) suggests a limit of no more than 1.6 × 10²² irreversible operations per second” (p. 5). This is because his estimate of the Landauer limit at body temperature differs from mine by about a factor of two – I’m not sure why.

594.From Open Philanthropy’s non-verbatim notes from a conversation with Prof. David Wolpert: “In Prof. Wolpert’s view, it is a subtle and interesting question how to do this type of calculation correctly. A rigorous version would require a large research project. One complexity is that the brain is an open system, in what would be formally called a non-equilibrium steady state, which continually receives new inputs and performs many computations at the same time, even though its entropy does not change that much over time. Landauer’s principle, though, applies to drops in entropy that occur in each step of a calculation. Various other caveats would also be necessary. For example, there are long-range correlations between bits, and there are multiple heat baths in the brain. As a simplified toy model, however, we can imagine that the brain computes in a serial fashion. It gets new inputs for each computation (thereby reinflating the entropy), and each computation causes a drop in entropy. In this case, the upper bound on bit-erasures suggested by Mr. Carlsmith would apply. Prof. Wolpert’s thinks that this calculation is legitimate as a first-pass, back-of-the-envelope upper bound on the bit-erasures that the brain could be implementing. It couldn’t get published in a physics journal, but it might get published in a popular science journal, and it helps get the conversation started” (p. 3). I expect that further investigation would reveal other complexities as well.

595.Jared Kaplan’s notes on Statistical Mechanics & Thermodynamics: “Say we add two numbers, eg 58 + 23 = 81. We started out with information representing both 58 and 23. Typically this would be stored as an integer, and for example a 16 bit integer has information, or entropy, 16 log 2. But at the end of the computation, we don’t remember what we started with, rather we just know the answer. Thus we have created an entropy S = 2 × (16 log 2) − (16 log 2) = 16 log 2 through the process of erasure!” (p. 59). See also Hänninen and Takala (2010): “The binary addition operation performs an unbalanced compression between the input and output state spaces, since the mapping between the values is not bijective. Medium-sized result values can originate from the largest set of possible input operand pairs. The addition of two n-bit binary operands results in at most an (n + 1)-bit result, and the result value 2n − 1 compresses the largest group of input pairs, 2n distinct cases, into the single output. Thus, the logical reversal of the addition requires the result word and n extra bits, which could be chosen simply to represent one of the input operands. The number of bits required to reverse the binary addition, as one indivisible logical operation, can be interpreted as the minimum amount of information lost in any irreversible adder structure at best. This loss determines the minimum achievable energy cost per operation” (p. 224). See also Hänninen and Takala (2010): “The binary addition operation performs an unbalanced compression between the input and output state spaces, since the mapping between the values is not bijective. Medium-sized result values can originate from the largest set of possible input operand pairs. The addition of two n-bit binary operands results in at most an (n + 1)-bit result, and the result value 2n − 1 compresses the largest group of input pairs, 2n distinct cases, into the single output. Thus, the logical reversal of the addition requires the result word and n extra bits, which could be chosen simply to represent one of the input operands. The number of bits required to reverse the binary addition, as one indivisible logical operation, can be interpreted as the minimum amount of information lost in any irreversible adder structure at best. This loss determines the minimum achievable energy cost per operation” (p. 224). See also Hänninen and Takala (2010) (p. 2370), for comparable discussion re: multiplication. Hänninen et al. (2011) discuss the possibility of less-than-n bit erasures for word-length n operations in the context of “non-trivial multiplication,” which, at a glance, seems to involve excluding multiplications that take zero as an operand (see p. 2371).

596.Hänninen et al. (2011) estimate the bit-erasures implicated by various proposed multiplier implementations. The array multiplier is the most efficient, at 8n² for n-bit words (see Table II, p. 2372). 8 × 42 = 128; 83 = 512.

597.Sarpeshkar (1998) discusses more efficient, analog implementations: “Items 1 through 3 show that analog computation can be far more efficient than digital computation because of analog computation’s repertoire of rich primitives. For example, addition of two parallel 8-bit numbers takes one wire in analog circuits (using Kirchoff’s current law), whereas it takes about 240 transistors in static CMOS digital circuits. The latter number is for a cascade of 8 full adders. Similarly an 8-bit multiplication of two currents in analog computation takes 4 to 8 transistors, whereas a parallel 8-bit multiply in digital computation takes approximately 3000 transistors” (p. 1605).

598.See also Hänninen et al. (2011): “Present CMOS effectively performs an erasure every time a transistor switches states—generating hugely unnecessary levels of heat” (p. 2370).

599.Sarpeshkar (1998): “an 8-bit multiplication of two currents in analog computation takes 4 to 8 transistors, whereas a parallel 8-bit multiply in digital computation takes approximately 3000 transistors” (p. 1605).

600.Asadi and Navi (2007): “Table 3: comparison between 32 × 32 bit multipliers … Transistor counts: 21579.00, 25258.00, 32369.00” (Table 3, p. 346).

601.Given the probability distribution over inputs to which the brain is in fact exposed, that is.

602.My thanks to Prof. David Wallace for discussion.

603.From Open Philanthropy’s non-verbatim notes from a conversation with Dr. Jess Riedel: “There is a simple algorithm for converting a computation that uses logically irreversible operations into an equivalent computation that uses logically reversible operations. This allows you to avoid almost all of the relevant logical bit-erasures” (p. 4). And from Open Philanthropy’s non-verbatim notes from a conversation with Dr. Paul Christiano: “We believe that you can perform extremely complex computations with almost no bit erasures using good enough hardware” (p. 4). See also Bennett (1989): “Reversible computers of various kinds (Turing machines, cellular automata, combinational logic) have been considered [1], [11], [12], [13], [6], [2], [14] especially in connection with the physical question of the thermodynamic cost of computation; and it has been known for some time that they can simulate the corresponding species of irreversible computers in linear time [1] (or linear circuit complexity 13]), provided they are allowed to leave behind at the end of the computation a copy of the input (thereby rendering the mapping between initial and final states 1:1 even though the input-output mapping may be many-to-one)” (p. 766). See also Sagawa (2014), p. 8 in the arxiv version), and Bennett (1973). For disagreement/controversy, see Wolpert (2019a): “Summarizing, it is not clear that there is a way to implement a logically irreversible function with an extended circuit built out of logically reversible gates that reduces the Landauer cost below the Landauer cost of an equivalent AO [“all at once”] device. The effect on the mismatch cost of using such a circuit rather than an AO device is more nuanced, varying with the priors, the actual distribution, etc.” (p. 33 of the arxiv paper). My understanding is that the crux of this objection hinges on the fact that the reversible circuit will need to be reused, which means that its inputs and outputs will need to be reinitialized: “In general, the Landauer cost and mismatch cost of answer-reinitialization of an extended circuit will be greater than the corresponding answer-reinitialization costs of an equivalent AO device. This is for the simple reason that the answer-reinitialization of the extended circuit must reinitialize the bits containing copies of x and m, which do not even exist in the AO device” (p. 30 of the arxiv paper). See also Open Philanthropy’s non-verbatim notes from a conversation with Prof. David Wolpert (p. 2). Dr. Jess Riedel was skeptical of this sort of objection. From Open Philanthropy’s non-verbatim notes from a conversation with Dr. Jess Riedel: “Dr. Riedel is skeptical of objections to the viability of reversible computing that appeal to the bit-erasures involved in receiving new inputs and writing new final outputs. It’s true that reversible computing paradigms require bit-erasures for this, but for most interesting computations, the intermediate memory usage is much (often exponentially) larger than the input and output data” (p. 5). I have not attempted to evaluate this debate in detail. If Prof. Wolpert is correct, then algorithmic arguments look stronger.

604.Sagawa (2014): “A computational process C is logically reversible if and only if it is an injection. In other words, C is logically reversible if and only if, for any output logical state, there is a unique input logical state. Otherwise, C is logically irreversible” (p. 7 in the arxiv version).

605.Hänninen and Takala (2010): “the logical reversal of the addition requires the result word and n extra bits, which could be chosen simply to represent one of the input operands” (p. 224). And see also Jared Kaplan’s notes on Statistical Mechanics & Thermodynamics: “In principle we can do even better through reversible computation. After all, there’s no reason to make erasures. For example, when adding we could perform an operation mapping (x, y) → (x, x + y), for example (58, 23) → (58, 81), so that no information is erased. In this case, we could in principle perform any computation we like without producing any waste heat at all. But we need to keep all of the input information around to avoid creating entropy and using up energy” (p. 60).

606.Johnson (1999): “Efficient as such a system would be, there would still be drawbacks. In a complex calculation, the extra memory needed to save all the intermediary ”garbage bits” can grow wildly. As a compromise, Dr. Bennett devised a memory-saving method in which a computer would carry out a few steps of the calculation, copy the result and rewind. Then, starting with the copied result, it would take a few more steps. He likened the method to crossing a river using just a few stepping stones: one must backtrack to pick up the stones left behind, placing them in the path ahead. While the procedure would consume less memory, it would require more computational steps, slowing down the calculation. To computer scientists, this was a classic tradeoff: pay the computational cost with either memory space or processing time.” Wolpert (2019b): “One of the properties of logically reversible gates that initially caused problems in designing circuits out of them is that running those gates typically produces “garbage” bits, to go with the bits that provide the output of the conventional gate that they emulate. The problem is that these garbage bits need to be reinitialized after the gate is used, so that the gate can be used again. Recognizing this problem, [50] shows how to avoid the costs of reinitializing any garbage bits produced by using a reversible gate in a reversible circuit C ′ , by extending C ′ with yet more reversible gates (e.g., Fredkin gates). The result is an extended circuit that takes as input a binary string of input data x, along with a binary string of “control signals” m ∈ M, whose role is to control the operation of the reversible gates in the circuit. The output of the extended circuit is a binary string of the desired output for input xIN , xOUT = f(x N), together with a copy of m, and a copy of xIN, which I will write as xINcopy. So in particular, none of the output garbage bits produced by the individual gates in the original, unextended circuit of reversible gates still exists by the time we get to the output bits of the extended circuit. While it removes the problem of erasing the garbage bits, this extension of the original circuit with more gates does not come for free. In general it requires doubling the total number of gates (i.e., the circuit’s size), doubling the running time of the circuit (i.e., the circuit’s depth), and increasing the number of edges coming out of each gate, by up to a factor of 3. (In special cases though, these extra cost can be reduced, sometimes substantially.)” (p. 28). See also Michael Frank’s comments here: “It is probably the case that general reversible computations do require some amount of overhead in either space or time complexity; indeed, Ammer and I proved rigorously that this is true in a certain limited technical context. But, the overheads of reversible algorithms can theoretically be overwhelmed by their energy-efficiency benefits, to improve overall cost-performance for large-scale computations.”

607.From Open Philanthropy’s non-verbatim notes from a conversation with Dr. Jess Riedel: “For large computations, this conversion adds only a modest overhead in required time and memory. For example, the algorithm described in Charles Bennett’s 1989 paper ‘Time/Space Trade-Offs for Reversible Computation’ involves slow-downs of at worst a multiplicative factor, around 2-3× as slow” (p. 4). See also Open Philanthropy’s non-verbatim notes from a conversation with Dr. Paul Christiano: “The algorithmic overhead involved in reversible computing (specifically, the overhead involved in un-computing what you have already computed) is not that bad. Most of the difficulty lies in designing such efficient hardware” (p. 4). Bennett (1989): “Using a pebbling argument, this paper shows that, for any e > 0, ordinary multitape Turing machines using time T and space S can be simulated by reversible ones using time O(T1 + F) and space O(S log T) or in linear time and space O(STe)… The time/space cost of computing a 1:1 function on such a machine is equal within a small polynomial to the cost of computing the function and its inverse on an ordinary Turing machine” (p. 766). See also Wolpert’s (2019a) overhead estimates, e.g.: “In general it requires doubling the total number of gates (i.e., the circuit’s size), doubling the running time of the circuit (i.e., the circuit’s depth), and increasing the number of edges coming out of each gate, by up to a factor of 3” (p. 28).

608.From Open Philanthropy’s non-verbatim notes from a conversation with Dr. Jess Riedel: “When humans write software to accomplish human objectives, they use a lot of irreversible steps (though there are some non-atomic reversible intermediate computations, like Fourier transforms)” (p. 4).

609.From Open Philanthropy’s non-verbatim notes from a conversation with Dr. Jess Riedel: “When the world has some simple feature (e.g., the position and velocity of a rock heading towards your head), this feature is encoded in very complicated intermediate systems (e.g., the trillions of photons scattering from the rock and heading towards your eye). The brain has to distill an answer to a high-level question (e.g., “do I dodge left or right?”) from the complicated intermediate system, and this involves throwing out a lot of entropy” (p. 4).

610.From Open Philanthropy’s non-verbatim notes from a conversation with Prof. Jared Kaplan: “FLOPs in actual computers erase bits, and Prof. Kaplan expects that you generally have order one bit-erasures per operation in computational systems. That is, you don’t do a lot of complicated things with a bit, and then erase it, and then do another set of very complicated things with another bit, and then erase it, etc. Prof. Kaplan’s intuition in this respect comes from his understanding of certain basic operations you can do with small amounts of information. In principle you can perform a very complicated set of transformations on a piece of information, like an image, without erasing bits. Prof. Kaplan can imagine some kind of order one factor increase in required compute from this type of thing” (p. 4).

611.From Open Philanthropy’s non-verbatim notes from a conversation with Dr. Jess Riedel: “if (as in current conventional computers) you’re dissipating thousands of kT per operation, it isn’t worth transitioning to logically reversible operations, because other forms of energy dissipation dominate the Landauer-mandated energy costs of logical irreversibility” (p. 4).

612.From Open Philanthropy’s non-verbatim notes from a conversation with Dr. Paul Christiano: “Dr. Christiano does not think that logically irreversible operations are a more natural or default computational unit than reversible ones. And once we’re engaging with models of brain computation that invoke computations performed by low-level, reversible elements, then we are assuming that the brain is able to make use of such elements, in which case it may well have evolved a reliance on them from the start. For example, if it were possible to use proteins to directly perform large tunable matrix multiplications, Landauer’s principle implies that those matrix multiplications would necessarily be invertible or even unitary. But unitary matrix multiplications are just as useful for deep learning as general matrix multiplications, so Landauer’s principle per se doesn’t tell us anything about the feasibility of the scenario. Instead the focus should be on other arguments (e.g. regarding consistency and flexibility)” (p. 4).

613.My thanks to Prof. David Wallace for discussion.

614.Michael Frank gives a summary of the development of the literature on reversible computing here (see paragraphs starting with “I’ll summarize a few of the major historical developments…”).

615.See this 2014 interview with the Machine Intelligence Research Institute.

616.From Open Philanthropy’s non-verbatim notes from a conversation with Dr. Michael Frank: “The biggest challenge is figuring out the fundamental physics involved in improving the trade-offs between energy dissipation and speed in reversible processes. We don’t know of any fundamental limits in this respect at the moment, but there may be some, and we need to understand them if so. One question is whether exploiting quantum phenomena can help. Dr. Frank is working on this at the moment. There are also practical issues involved in improving the degree of reversibility of mechanisms that we know how to design in principle, but which require a lot of advanced, high-precision engineering to get the level of efficiency we want. And there is a lot of engineering and design work to do at the level of circuits, architectures, design tools, and hardware description languages” (p. 2). See also page 1: “A lot of advanced physics and engineering is necessary for figuring out how to do reversible computing well. The goal is to create very fast, very energy-efficient systems. Currently, the closest examples are fairly rudimentary systems like simple oscillators. The transition to reversible computing won’t happen overnight, and it may take decades, even once fundamental problems are solved.”

617.From Open Philanthropy’s non-verbatim notes from a conversation with Dr. Paul Christiano: “In irreversible computers, you do not need to keep track of and take into account what happens to each degree of freedom, because you are able to expend energy to reset the system to a state it needs to be in for your computation to proceed successfully. With reversible computers, however, you aren’t able to expend such energy, so what happens to any degree of freedom that could influence your computation starts to matter a lot; you can’t simply force the relevant physical variables into a particular state, so your computation needs to work for the particular state that those variables happen to be in. Given the reversibility of physics, this is a very difficult engineering challenge” (p. 5).

618.This is based primarily on eyeballing the chart presented at 4:17 in Michael Frank’s 2017 YouTube talk (Frank cites the International Roadmap of Semiconductors 2015, though I’m not sure where the specific information he’s pointing to comes from). According to Frank’s description of this chart, if you include various overhead factors that Frank suggests are extremely difficult to eliminate, we are currently dissipating around 10,000-50,000 kT per grounding of a circuit node at T=300K. The minimum energy used to switch the state of a minimum-sized transistor is smaller, between 100-1000 kT, but Frank suggests that using minimum-sized transistors is not always optimal for performance, and other overheads are in play as well. See also Frank (2018): “As the end of the semiconductor roadmap approaches, there is today a growing realization among industry leaders, researchers, funding agencies and investors that a transition to novel computing paradigms will be required in order for engineers to continue improving the energy efficiency (and thus, cost efficiency) of computing technology beyond the expected final CMOS node, when minimal transistor gate energies are expected to plateau at around the 40-80 kT level (∼ 1-2 eV at room temperature), with typical total CV2 node energies plateauing at a much higher level of around 1-2 keV” (p. 2). Hänninen et al. (2011) also note that the Landauer limit is “nearly three orders of magnitude lower than end-of-the-roadmap CMOS transistors,” (p. 2370) which is roughly where Frank’s chart forecasts the asymptote for minimum-size transistors (if we include circuit-level overhead factors, it’s another couple orders of magnitude). Jess Riedel notes that humans can, if necessary, create very special-purpose computational devices that get much closer to Landauer’s limit (this, he suggests, is what the “experimental tests” of Landauer’s limit attempt to do), but that these aren’t useful for practical, large-scale computing (see Open Philanthropy’s non-verbatim notes from a conversation with Dr. Jess Riedel, p. 3). See also this conversation with Erik DeBenedictis, who predicts 2000 kT/logic op by 2030, including interconnect wire.

619.See calculation here.

620.See Aiello’s (1997) for some discussion. From Open Philanthropy’s non-verbatim notes from a conversation with Prof. David Wolpert: “Metabolic constraints are extremely important in evolutionary biology. But the field of evolutionary biology has not adequately incorporated discoveries about the energy costs of the computation. The massive energy costs of the brain ground a presumption that it has been highly optimized for thermodynamic efficiencies. Understanding better how the brain’s architecture balances energy costs with computational performance may lead to important breakthroughs. However, at this point we are basically clueless about how the brain’s computation works, so we can’t even state this problem precisely” (p. 3).

621.See e.g. Kempes et al. (2017): “Here we show that the computational efficiency of translation, defined as free energy expended per amino acid operation, outperforms the best supercomputers by several orders of magnitude, and is only about an order of magnitude worse than the Landauer bound” (p. 1). Rahul Sarpeshkar, in a 2018 TED talk, suggests that cells are the most energy efficient computers that we know, and that they are already computing at an efficiency near the fundamental laws of physics (3:30-4:04). See also Laughlin et al. (1998): “Freed from heavy mechanical work, ion channels change conformation in roughly 100 μs32. In principle, therefore, a single protein molecule, switching at the rate of an ion channel with the stoichiometry of kinesin, could code at least 10³ bit per second at a cost of 1 ATP per bit” (p. 39). See Sarpeshkar (2013) for more on computation in cells, and Sarpeshkar (2010) for more on the energy-efficiency of biological systems more generally: “A single cell in the body performs ~10 million energy-consuming biochemical operations per second on its noisy molecular inputs with ~1 pW of average power. Every cell implements a ~30,000 node gene-protein molecular interaction network within its confines. All the ~100 trillion cells of the human body consume ~80 W of power at rest. The average energy for an elementary energy-consuming operation in a cell is about 20kT, where kT is a unit of thermal energy. In deep submicron processes today, switching energies are nearly 10⁴ – 10⁵kT for just an elementary 0->1 digital switching operation. Even at 10 nm, the likely end of business-as-usual transistor scaling in the future, it is unlikely that we will be able to match such energy efficiency. Unlike traditional digital computation, biological computation is tolerant to error in elementary devices and signals. Nature illustrates that it is significantly more energy efficient to compute with error-prone devices and signals and then correct for these errors through feedback-and-learning architectures than to make every device and every signal in a system robust, as in traditional digital paradigms thus far” (p. 18-19). Bennett (1989) also suggests that “a few thermodynamically efficient data processing systems do exist, notably genetic enzymes such as RNA polymerase, which, under appropriate reactant concentrations, can transcribe information from DNA to RNA at a thermodynamic cost considerably less than kT per step” (p. 766); see also Bennett (1973): “Tape copying is a logically reversible operation, and RNA polymerase is both thermodynamically and logically reversible” (p. 532). See also Ouldridge and ten Wolde (2017), Ouldridge (2017), Sartori et al. (2014), Mehta and Schwab (2012), and Mehta et al. (2016). Though see also Open Philanthropy’s non-verbatim notes from a conversation with Dr. Jess Riedel: “Biology may be very energy efficient in certain cases, but Dr. Riedel still thinks it very unlikely that the efficiency of the brain’s computation is anywhere near Landauer’s limit. There are also likely to be other examples in which biology is extremely inefficient relative to Landauer’s principle, due to other constraints (for example, cases in which biological systems use chemical gradients involving billions of molecules to communicate ~5 bits of information). Humans can, if necessary, create very special-purpose computational devices that get close to Landauer’s limit (this is what “experimental tests” of Landauer’s limit attempt to do), and our power plants, considered as thermodynamic heat engines, are very efficient (e.g., nearing thermodynamic bounds). However, our useful, scalable computers are not remotely close to the minimal energy dissipation required by Landauer’s principle. This appears to be an extraordinarily hard engineering problem, and it’s reasonable to guess that brains haven’t solved it, even if they are very energy efficient elsewhere. ” (p. 3).

622.From Open Philanthropy’s non-verbatim notes from a conversation with Dr. Michael Frank: “In general, Dr. Frank does not see evidence that biology is attempting to do anything like what human engineers working on reversible computing are trying to do. Reversible computing is an extremely advanced tier of high-precision engineering, which we’re still struggling to figure out. Biology, by contrast, seems perfectly happy with what it can do with simple, irreversible mechanisms. … In general, most signaling mechanisms in biology are highly dissipative. For example, the biophysical processes involved in neural firing (e.g., vesicle release, action potential propagation, ion channels driving the ion concentrations to new states) dissipate lots of energy. Indeed, most of life seems to be based on strongly driven (e.g., irreversible) processes” (p. 4). From Open Philanthropy’s non-verbatim notes from a conversation with Prof. David Wolpert: “Prof. Wolpert also expects that using Landauer’s principle to estimate the amount of computation performed by the brain will result in substantial overestimates. A single neuron uses very complicated physical machinery to propagate a single bit along an axon. Prof. Wolpert expects this to be very far away from theoretical limits of efficiency. That said, some computational processes in biology are very energy efficient. For example, Prof. Wolpert recently co-authored a paper on protein synthesis in ribosomes, showing that the energy efficiency of the computation is only around two orders of magnitude worse than Landauer’s bound. Prof. Wolpert expects neurons to be much less efficient than this, but he doesn’t know” (p. 4).

623.See Laughlin et al. (1998): “Synapses and cells are using 10⁵ to 10⁸ times more energy than the thermodynamic minimum. Thermal noise sets a lower limit of k · T Joules for observing a bit of information (k, Boltzmann’s constant; T, absolute temperature, 290K) and the hydrolysis of one ATP molecule to ADP releases about 25 kT” (p. 39). “Thermal noise sets a lower limit of k × T Joules for observing a bit of information (k, Boltzmann’s constant; T, absolute temperature, 290K” (p. 39). Laughlin et al. (1998) also note that “At least two biophysical constraints will contribute to these systems’ costs. First, there is the uncertainty associated with molecular interactions. The stochastic nature of receptor activation (photon absorption), of molecular collision, of diffusion, and of vesicle release, degrades information by introducing noise (eqns. 1 and 7), thereby substantially increasing costs. Secondly, energy is required to distribute signals over relatively large distances. We suggest, therefore, that the high metabolic cost of information in systems is dictated by basic molecular and cellular constraints to cell signaling, as independently proposed by Sarpeshkar (see also Sarpeshkar (1997))” (p. 37).

624.Lennie (2003) writes that “The aggregate cost of a spike is 2.4 × 10⁹ ATP molecules” (p. 493), and with Laughlin et al. (1998), who write that “the hydrolysis of one ATP molecule to ADP releases about 25 kT” (p. 39) (see also discussion here). 2.4e9 × 25 = 6e10. See also Bennett (1981): “Macroscopic size also explains the poor efficiency of neurons, which dissipate about 10¹¹ kT per discharge” (p. 907).

625.From Open Philanthropy’s non-verbatim notes from a conversation with Prof. David Wolpert: “Prof. Wolpert also expects that using Landauer’s principle to estimate the amount of computation performed by the brain will result in substantial overestimates. A single neuron uses very complicated physical machinery to propagate a single bit along an axon. Prof. Wolpert expects this to be very far away from theoretical limits of efficiency” (p. 4).

626.From Open Philanthropy’s non-verbatim notes from a conversation with Dr. Jess Riedel: “Presumably, we think we basically understand cases where the brain is sending very simple signals, like the signal to kick your leg. We know that the nerves involved in conveying these signals are operating in an irreversible way, and burning way more energy than the Landauer limit would say is necessary to communicate the number of bits needed to say e.g. how much to move the muscle. It seems this energy is required partly because the nerve is a big and complicated system, with many moving parts, so redundancy is necessary” (p. 3). From Open Philanthropy’s non-verbatim notes from a conversation with Prof. Jared Kaplan: “For example, a lot of synapses, not too dissimilar from synapses in the brain, are used to send information to e.g. a muscle. Those synapses are using a lot of energy, and the brain is clearly going through a lot of effort to convey the relevant information confidently” (p. 3).

627.Laughlin et al. (1998) write that “the hydrolysis of one ATP molecule to ADP releases about 25 kT” (p. 39) (see also discussion here). Sarpeshkar (2014) also mentions “20 kT per molecular operation (1 ATP molecule hydrolysed)” (section 1). Swaminathan (2008) characterize ATP as “the primary source of cellular energy” in rat brains; and studies of brain metabolism like Lennie (2003) use ATPs as the central basis for measuring the brain’s energy budget

628.From Open Philanthropy’s non-verbatim notes from a conversation with Dr. Paul Christiano: “Dr. Christiano would be extremely surprised if the brain got more computational mileage out of a single ATP than human engineers can get out of a FLOP, and he would be very willing to bet that it takes at least 10 ATPs to get the equivalent of a FLOP. Mr. Carlsmith estimates that the brain can be using no more than ~1e20 ATPs/second. If this estimate is right, then Dr. Christiano is very confident that you do not need more than 1e20 FLOP/s to replicate the brain’s task-performance” (p. 5).

629.Calculation here. This link also lists 1e-19 J per molecule, and 30-60 kJ per mole. Lennie (2003) estimates a “gross consumption of 3.4 × 10²¹ molecules of ATP per minute” in the cortex, and that “in the normal awake state, cortex accounts for 44% of whole brain energy consumption,” suggesting ~6e19 ATPs/s in the cortex, and ~1e20 for the brain overall.

630.From Open Philanthropy’s non-verbatim notes from a conversation with Prof. Jared Kaplan: “In general, Prof. Kaplan thinks it unlikely that big, warm things are performing thermodynamically reversible computations” (p. 3). From Open Philanthropy’s non-verbatim notes from a conversation with Dr. Jess Riedel: “… It seems this energy is required partly because the nerve is a big and complicated system, with many moving parts, so redundancy is necessary” (p. 3). See also Bennett (1981): “Macroscopic size also explains the poor efficiency of neurons, which dissipate about 10¹¹ kT per discharge” (p. 907).

631.From Open Philanthropy’s non-verbatim notes from a conversation with Prof. Jared Kaplan: “In general, Prof. Kaplan thinks it unlikely that big, warm things are performing thermodynamically reversible computations” (p. 3).

632.From Open Philanthropy’s non-verbatim notes from a conversation with Prof. Jared Kaplan: “If you’re in a regime where there is some signal to noise ratio, and you make your signal big to avoid noise, you can’t be doing something thermodynamically reversible: the noise is creating waste heat, and you’re extending your signal to get above that. Prof. Kaplan would have thought that basically all of the processes in the brain have this flavor” (p. 3). Laughlin et al. (1998) also note that “At least two biophysical constraints will contribute to these systems’ costs. First, there is the uncertainty associated with molecular interactions. The stochastic nature of receptor activation (photon absorption), of molecular collision, of diffusion, and of vesicle release, degrades information by introducing noise (eqns. 1 and 7), thereby substantially increasing costs. Secondly, energy is required to distribute signals over relatively large distances. We suggest, therefore, that the high metabolic cost of information in systems is dictated by basic molecular and cellular constraints to cell signaling, as independently proposed by Sarpeshkar (see also Sarpeshkar (1997))” (p. 37).

633.From Open Philanthropy’s non-verbatim notes from a conversation with Prof. Jared Kaplan: “Processes that involve diffusion also cannot be thermodynamically reversible. Diffusion increases entropy. For example, if you take two substances and mix them together, you have increased the entropy of that system” (p. 3). From Open Philanthropy’s non-verbatim notes from a conversation with Dr. Michael Frank: “One example difference is that reversible computing engineers can use inertia to propagate signals at the speed of light, with very little energy dissipation. They can also achieve similarly efficient, high-speed results by sending magnetic flux quanta through superconducting circuits. The brain, however, relies on diffusion, which cannot take advantage of such inertia” (p. 4).

634.From Open Philanthropy’s non-verbatim notes from a conversation with Prof. Jared Kaplan (p. 3):

In general, it’s extremely difficult to build reversible computers. For example, all of the quantum computers we have are very rudimentary (quantum computers are a type of reversible computer), and it’s hard to keep them running for very long without destroying information. In order to be performing thermodynamically reversible computations, each neuron would have to have some sort of very specialized component, operating in a specialized environment crafted in order to perform the computation in a thermodynamically reversible way. It would be hard to keep this running for very long, and Prof. Kaplan doesn’t think this is happening. From Open Philanthropy’s non-verbatim notes from a conversation with Dr. Jess Riedel (p. 3):

Humans can, if necessary, create very special-purpose computational devices that get close to Landauer’s limit (this is what ‘experimental tests’ of Landauer’s limit attempt to do), and our power plants, considered as thermodynamic heat engines, are very efficient (e.g., nearing thermodynamic bounds). However, our useful, scalable computers are not remotely close to the minimal energy dissipation required by Landauer’s principle. This appears to be an extraordinarily hard engineering problem, and it’s reasonable to guess that brains haven’t solved it, even if they are very energy efficient elsewhere.

From Open Philanthropy’s non-verbatim notes from a conversation with Dr. Michael Frank (p. 3-4):

In general, Dr. Frank does not see evidence that biology is attempting to do anything like what human engineers working on reversible computing are trying to do. Reversible computing is an extremely advanced tier of high-precision engineering, which we’re still struggling to figure out. Biology, by contrast, seems perfectly happy with what it can do with simple, irreversible mechanisms.

From the non-verbatim notes from my conversation with Dr. Paul Christiano (p. 5):

Dr. Christiano expects that experts in physics, chemistry, and computer engineering would generally think it extremely unlikely that the brain is erasing less than one bit per computationally useful FLOP it performs. If the brain were doing this, Dr. Christiano believes that this would mean that the brain is qualitatively much more impressive than any other other biological machinery we are aware of.

635.The FLOP/s costs of the models in Beniaguev et al. (2020), Maheswaranathan et al. (2019), and Batty et al. (2017) are the most salient exception.

636.I don’t give much weight to the energy costs of current digital multiplier implementations, given that analog implementations may be much more efficient (see Sarpeshkar (1998) (p. 1605)).

637.A number of my confusions center on theoretical issues related to identifying the set of the computations that a physical system can be said to implement (see Piccinini (2017) for an introduction). For example, a simulation of a physical system at any level of detail is interpretable as a set of (possibly stochastic) transitions between logical states, and hence as a computation implemented by this system. In this sense, any physical system, dissipating a given amount of energy (a box of gas, a hurricane, etc.), implements an extremely complex computation that describes exactly what it in fact does or would do given different inputs. What’s more, there are broader questions about whether a given physical system can be understood as implementing any computation, given a sufficiently unnatural carving of logical states (see e.g. Aaronson (2011) (p. 23); Drescher (2006), Chapter 2, and Hemmo and Shenker (2019)). I feel very unclear about how both of these theoretical issues interact with constraints imposed by Landauer’s principle, and with estimates of the FLOP/s required to re-implement the computations in question. Indeed, note if it were possible to move easily from bit-erasures to FLOP/s, then naively applied, the Landauer argument discussed here seems to suggest that you can cap the FLOP/s required to simulate a physical system via the energy that system dissipates – a conclusion which fits poorly with the extreme computational costs of simulating low-level physical systems like interacting molecules or proteins in lots of detail. Tom Davidson also suggested that this understanding of Landauer’s principle has the somewhat strange implication that a system that gives the same output regardless of the input would have the highest Landauer energy costs, which seems somewhat strange to me (especially if we’re allowed to interpret any set of microstates as an output state). Prof. David Wolpert suggested a number of other possible complexities in our conversation (see Open Philanthropy’s non-verbatim notes from a conversation with Prof. David Wolpert (p. 3)) that I haven’t engaged with, and I expect that further investigation would uncover more.

638.In the context of human hardware, I’ll use the term to cover both on-chip memory bandwidth and bandwidth between chips, since brain-equivalent systems can use multiple chips; in some contexts, like a TPU, we might also include very short-distance communication taking place between ALUs. From Open Philanthropy’s non-verbatim notes from a conversation with Dr. Paul Christiano: “Across many different models of computation (e.g. Turing Machines, RAM machines, circuits, etc.), computational resources tend to fall into a number of broad categories, including:

Memory (e.g., data the computer can store),

Communication (roughly, the amount of information the computer can send from one part to another),

Compute/number of operations.

The exact meaning of these concepts varies across models, but they are often useful to work with” (p. 1).

639.Howarth et al. (2012), Figure 1, estimate that maintaining resting potentials uses 15% of the total energy in the cortex (20% of signaling energy in the cortex), and action potentials use 16% (21% of signaling energy). Synaptic processes account for an additional 44% (see p. 1224). Schlaepfer et al. (2006), Table 1, suggests that white matter, which largely consists of myelinated axons, is about 30% of brain volume (p. 150). See Diamond (1996) for discussion of evolutionary pressures on metabolism and brain volume (p. 757).

640.See Dayan and Abbott (2001), Chapter 4 (p. 123-150); Zador (1998); Tsubo et al. (2012), Fuhrmann et al. (2001), Mainen and Sejnowski (1995), van Steveninck et al. (1997).

641.From Open Philanthropy’s non-verbatim notes from a conversation with Dr. Paul Christiano: “One can also distinguish between the bandwidth available at different distances. Axons vary in length, shorter-distance communication in neurons occurs via dendrites, and at sufficiently short distances, the distinction between communication and computation becomes blurry. For example, a multiply is in some sense mostly communication, and one can think of different processes taking place within neurons as communication as well. For longer-distance communication, though, axons seems like the brain’s primary mechanism” (p. 2).

642.See discussion in Section 2.3. From Open Philanthropy’s non-verbatim notes from a conversation with Dr. Paul Christiano: “There are other communication mechanisms in the brain (e.g., glia, neuromodulation, ephaptic effects), but Dr. Christiano expects that these will be lower-bandwidth than axon communication” (p. 2). This point is fairly similar to ones made in Section 2.3, but the idea here is that speed limits the information these mechanism can send different distances, rather than the amount of processing of information they can perform

643.From Open Philanthropy’s non-verbatim notes from a conversation with Dr. Paul Christiano: “the brain invests a sizeable portion of its energy and volume into communication via axons, which would be a strange investment if it had some other, superior communication mechanism available” (p. 2).

644.From Open Philanthropy’s non-verbatim notes from a conversation with Dr. Paul Christiano: “You can roughly estimate the bandwidth of axon communication by dividing the firing rate by the temporal resolution of spiking. Thus, for example, if the temporal precision is 1 ms, and neurons are spiking at roughly 1 Hz, then each spike would communicate ~10 bits of information (e.g., log₂(1000)). If you increase the temporal precision to every microsecond, that’s only a factor of two difference (e.g., log₂(1,000,000) = ~20 bits)… Roughly 1e8 axons cross the corpus callosum, and these account for a significant fraction of the length of all axons (AI Impacts has some estimates in this regard). Based on estimates Dr. Christiano has seen for the total length of all axons and dendrites, and the estimate that 1 spike/second = 10 bits/second across each, he thinks the following bounds are likely: 1e9 bytes/s of long-distance communication (across the brain), 1e11 bytes/s of short-distance communication (where each neuron could access about 10 million nearby neurons), and larger amounts of very-short distance communication.” (p. 2-3). See also Zhou et al. (2013): “The largest commissural tract in the human brain is the corpus callosum (CC), with more than 200 million axons connecting the two cerebral hemispheres” (p. E2714).

645.AI Impacts: “Traversed edges per second (TEPS) is a metric that was recently developed to measure communication costs, which were seen as neglected in high performance computing.⁸ The TEPS benchmark measures the time required to perform a breadth-first search on a large random graph, requiring propagating information across every edge of the graph (either by accessing memory locations associated with different nodes, or communicating between different processors associated with different nodes). You can read about the benchmark in more detail at the Graph 500 site.”

646.Their estimate makes a number of assumptions, including that (1) most relevant communication is between neurons (as opposed to e.g. internal to neurons); (2) that traversing an edge is relevantly similar to spiking; (3) that the distribution of edges traversed doesn’t make a material difference, and (4) that the graph characteristics are relevantly similar. I can imagine objections to (1) that focus on the possibility that important communication is taking place within dendrites (though tree structure arguments might limit the difference this makes); and objections, more generally, that focus on alternative conceptions of how many relevant “vertices” there are in the brain.

647.Here I describe a specific version of a general type of argument suggested by Dr. Paul Christiano. From Open Philanthropy’s non-verbatim notes from a conversation with Dr. Paul Christiano: “Dr. Christiano puts some weight on the following type of a priori argument: if you have two computers that are comparable on one dimension (e.g., communication), but you can’t measure how they compare along any other dimensions, then a priori your median guess should be that they are comparable on these other dimensions as well (e.g., it would be strange to have a strong view about which is better)” (p. 2). The argument described above also incorporates the constraint that the dimension in question be important to task-performance, and appeals to the skill of the engineers in question.

648.The argument appears in a different light if all you know is that e.g. both computers are green (though even there, it would seem strange to think that e.g. the one on the left is probably better than the one on the right, if you have no information to distinguish them). My thanks to Paul Christiano for discussion.

649.From Open Philanthropy’s non-verbatim notes from a conversation with Dr. Paul Christiano: “A V100 GPU has about 1e12 bytes/s of memory bandwidth on the chip (~10x the brain’s 1e11 bytes of short-distance communication, estimated above), and 3e11 bytes/s of off-chip bandwidth (~300x the brain’s 1e9 bytes/s of long-distance communication, estimated above). Dr. Christiano thinks that these memory access numbers are comparable, based on matching up the memory of a V100 (respectively, cluster of V100s) to the amount of information stored in synapses accessible by the “short-distance” (respectively, “long-distance”) connections described above” (p. 4).

650.From Open Philanthropy’s non-verbatim notes from a conversation with Dr. Paul Christiano (p. 2-3).

651.From Open Philanthropy’s non-verbatim notes from a conversation with Dr. Paul Christiano (p. 2-3).

652.From Open Philanthropy’s non-verbatim notes from a conversation with Dr. Paul Christiano: “If we knew nothing else about the brain, then, this might suggest that the brain’s computational capacity will be less than, or at least comparable to, a V100’s computational capacity (~1e14 FLOP/s) as well. And even if our compute estimates for the brain are higher, communication estimates are plausibly more robust, and they provide a different indication of how powerful the brain is relative to our computers” (p. 4).

653.From Open Philanthropy’s non-verbatim notes from a conversation with Dr. Kate Storrs: “Dr. Storrs’ sense is that, in the parts of the field she engages with most closely (e.g., systems level modeling, visual/cognitive/perceptual modeling, human behavior), and maybe more broadly, a large majority of people treat synaptic weights as the core learned parameters in the brain. That said, she is not a neurophysiologist, and so isn’t the right person to ask about what sort of biophysical complexities could imply larger numbers of parameters. She is peripherally aware of papers suggesting that glia help store knowledge, and there are additional ideas as well. The truth probably involves mechanisms other than synaptic weights, but she believes that the consensus is that such weights hold most of the knowledge” (p. 2). Though see Trettenbrein (2016) and Langille and Brown (2018) for some complications. And see here for a long list of quotes attesting to the role of synapses in memory.

654.See Section 2.1.1.

655.Bartol et al. (2015) suggest a minimum of “4.7 bits of information at each synapse” (they don’t estimate a maximum).

656.See Section 4.1.2.

657.Here I’m treating a synapse weight as ~1 byte.

658.From Open Philanthropy’s non-verbatim notes from a conversation with Dr. Paul Christiano: “In designing brains, evolution had to make trade-offs in allocating resources (e.g., energy consumption, space) to additional communication mechanisms, vs. additional mechanisms used for computation. Human engineers designing chips also have to make trade-offs in budgeting resources (energy, chip real-estate) to computation vs. communication. Equipped with an estimate of the communication profile of the brain, then, we might be able to use our knowledge of how to balance communication and computation in human computers to estimate what it would take to match the compute power of the brain, or to match its overall performance” (p. 2).

659.See here: “The [eight] supercomputers measured here consistently achieve around 1-2 GTEPS per scaled TFLOPS (see Figure 3). The median ratio is 1.9 GTEPS/TFLOPS, the mean is 1.7 GTEPS/TFLOP, and the variance 0.14 GTEPS/TFLOP.” However, AI Impacts notes that they only looked at data about the relationship between TEPS and FLOP/s in a small number of computers, and they have not investigated whether it makes sense to extrapolate from this data to the brain.

660.See here: “Among a small number of computers we compared⁴, FLOPS and TEPS seem to vary proportionally, at a rate of around 1.7 GTEPS/TFLOP. We also estimate that the human brain performs around 0.18 – 6.4 × 10¹⁴ TEPS. Thus if the FLOPS:TEPS ratio in brains is similar to that in computers, a brain would perform around 0.9 – 33.7 × 10¹⁶ FLOPS.⁵ We have not investigated how similar this ratio is likely to be.” 1e12/1.7e9=~600.

661.From Open Philanthropy’s non-verbatim notes from a conversation with Dr. Paul Christiano: “Dr. Christiano’s approach requires some sort of production function relating the returns from investment in communication to investment in compute. Dr. Christiano’s starting point would be something like logarithmic returns (though there aren’t really two buckets, so a more accurate model would be much messier), and he thinks that when you have two complimentary quantities (say, X and Y), a 50/50 resource split between them is reasonable across a wide range of production functions. After all, a 50% allocation to X will likely give you at least 50% of the maximal value that X can provide, and halving your allocation to X will only allow you to increase your allocation to Y by 50%” (p. 3).

662.From Open Philanthropy’s non-verbatim notes from a conversation with Dr. Paul Christiano: “Such a production function would also allow you to estimate what it would take to match the overall performance of the brain, even without matching its compute capacity. Thus, for example, it’s theoretically possible that biological systems have access to large amounts of very efficient computation. If we assume that the value of additional computation diminishes if communication is held fixed, though, then even if the brain has substantially more computation than human computers can mobilize, we might be able to match its overall performance regardless, by exceeding its communication capacity (and hence increasing the value of our marginal compute to overall performance)” (p. 3).

663.From Open Philanthropy’s non-verbatim notes from a conversation with Dr. Paul Christiano: “One complication here is that the communication to computation ratio in human computers has changed over time. For example, traditional CPUs had less computation per unit communication than the current hardware used for AI applications, like GPUs (Dr. Christiano says that this is partly because it is easier to write software if you can operate on anything in memory rather than needing to worry about communication and parallelization). If we applied CPU-like ratios to the brain, we would get very low compute estimates. Current supercomputers, though, spend more comparable amounts of energy on communication (including within chips) and compute” (p. 3).

664.See Open Philanthropy’s non-verbatim notes from a conversation with Prof. Barak Pearlmutter: “Prof. Hans Moravec attempted to derive estimates of the computational capacity of the brain from examination of the retina. Prof. Pearlmutter thought that Moravec’s estimates for the computational costs of robotic vision were likely accurate, given Moravec’s expertise in vision” (p. 3).

665.From Open Philanthropy’s non-verbatim notes from a conversation with Dr. Paul Christiano: “If you include a sufficiently broad range of tasks that the human brain can perform, and require similarly useful task-performance across the full range of inputs to which the brain could be exposed, it is likely that for at least one of the tasks in the relevant profile, for some set of inputs, the brain’s method will (a) be close to maximally algorithmically efficient (e.g., within an order of magnitude or two), and (b) use a substantial portion of the computational resources that the brain has available. For example, if you take a computer from the 60s, and you look at all of the tasks it could perform, Dr. Christiano expects that many of the algorithms it was running (for example: sorting), were close to optimally efficient. As another example, there is a very inefficient algorithm for SAT solving, which takes 2ⁿ time. For many inputs, we can improve on this algorithm by a huge amount, but we can’t for every input: indeed, there is a rough consensus amongst computer scientists that the very inefficient algorithm is close to the best one can do. Indeed, Dr. Christiano expects that for most algorithms, there will be some family of instances on which it does reasonably well. And given how large the space of possible tasks the brain performs is (we can imagine a very wide set of evaluation metrics and input regimes), the density of roughly-optimal-on-some-inputs algorithms doesn’t need to be that high for them to appear in the brain” (p. 7).

666.See here for V100 prices (currently ~$8799); and here the $1 billion Fugaku pricetag: “The six-year budget for the system and related technology development totaled about $1 billion, compared with the $600 million price tags for the biggest planned U.S. systems.” Fugaku FLOP/s performance is listed here, at around 4e17-5e17 FLOP/s. Google’s TPU supercomputer, which recently broke records in training ML systems, can also do ~4e17 FLOP/s, though I’m not sure the costs. See Kumar (2020): “In total, this system delivers over 430 PFLOPs of peak performance.” The A100, for ~$200,000, can do 5e15 FLOP/s – see Mehar (2020). NVIDIA’s newest SuperPOD can deliver ~7×10¹⁷ of AI performance – see Paikeday (2020).

667.See my colleague Ajeya Cotra’s investigation focuses on these issues.

668.From Open Philanthropy’s non-verbatim notes from a conversation with Prof. Eve Marder: “There are also some circuits in leeches, C. elegans, flies, and electric fish that are relatively well-characterized” (p. 4).

669.This is a criterion suggested by Dr. Paul Christiano. From Open Philanthropy’s non-verbatim notes from a conversation with Dr. Paul Christiano: “In thinking about conceptual standards to use in generating estimates for the FLOP/s necessary to run a task-functional model of a computational system that exhibits some degree of similarity to that system, one constraint is that when you apply your standard to digital systems that actually perform FLOPs, it ought to yield an answer of one FLOP per FLOP (e.g., your estimate for a V100, which performs ~1e14 FLOP/s, should be 1e14 FLOP/s). That is, it shouldn’t yield an estimate of the FLOPs necessary to e.g. model every transistor, or to model lower-level physical processes in transistors leading to e.g. specific patterns of mistaken bit-flips” (p. 7-8).

670.From Open Philanthropy’s non-verbatim notes from a conversation with Dr. Paul Christiano: “If you include a sufficiently broad range of tasks that the human brain can perform, and require similarly useful task-performance across the full range of inputs to which the brain could be exposed, it is likely that for at least one of the tasks in the relevant profile, for some set of inputs, the brain’s method will (a) be close to maximally algorithmically efficient (e.g., within an order of magnitude or two), and (b) use a substantial portion of the computational resources that the brain has available. For example, if you take a computer from the 60s, and you look at all of the tasks it could perform, Dr. Christiano expects that many of the algorithms it was running (for example: sorting), were close to optimally efficient. As another example, there is a very inefficient algorithm for SAT solving, which takes 2ⁿ time. For many inputs, we can improve on this algorithm by a huge amount, but we can’t for every input: indeed, there is a rough consensus amongst computer scientists that the very inefficient algorithm is close to the best one can do. Indeed, Dr. Christiano expects that for most algorithms, there will be some family of instances on which it does reasonably well. And given how large the space of possible tasks the brain performs is (we can imagine a very wide set of evaluation metrics and input regimes), the density of roughly-optimal-on-some-inputs algorithms doesn’t need to be that high for them to appear in the brain” (p. 7).

671.From Open Philanthropy’s non-verbatim notes from a conversation with Prof. Rosa Cao: “Prof. Cao does not believe that there is a privileged description of the computations that the brain is performing. We can imagine many different possible computational models of the brain, which will replicate different types of behavior, to within a given error-tolerance, in a given circumstance. In order to determine which biophysical processes are important, and what level of precision and detail you need in a model, you first need to specify the particular type of input-output relationship that you care about, and how the relevant outputs need to be produced. More generally, Prof. Cao thinks that the computational paradigm in neuroscience is conceptually underspecified. That is, the field is insufficiently clear about what it means to talk about the computations that the brain is performing” (p. 1).

672.From Open Philanthropy’s non-verbatim notes from a conversation with Dr. Paul Christiano: “In the case of the brain, for example, a high-level description might be something like ‘it divides the work between these two hemispheres in the following way.’ Thus, to meet the relevant standard, ‘brain-like’ computational models will only need to replicate that hemispheric division. Beyond that, they can just employ the maximally efficient way of performing the task” (p. 8).

673.See Marr (1982) (p. 25).

674.From Open Philanthropy’s non-verbatim notes from a conversation with Prof. Chris Eliasmith: “There is no privileged model of the brain which can claim to be the model of how the brain performs tasks. You can’t answer someone’s question about how the brain works without knowing exactly what the question is. Nor is there a privileged level of biological detail that a model needs to include in order count as a brain model, as all models are wrong to some extent. You can, though, specify a particular set of functions that a model needs to reproduce, with a particular degree of similarity to human behavior and anatomical and physiological data. Prof. Eliasmith’s work is basically oriented towards building a brain model that satisfies constraints of this type” (p. 4). From Open Philanthropy’s non-verbatim notes from a conversation with Prof. Rosa Cao: “Prof. Cao does not believe that there is a privileged description of the computations that the brain is performing. We can imagine many different possible computational models of the brain, which will replicate different types of behavior, to within a given error-tolerance, in a given circumstance. In order to determine which biophysical processes are important, and what level of precision and detail you need in a model, you first need to specify the particular type of input-output relationship that you care about, and how the relevant outputs need to be produced. More generally, Prof. Cao thinks that the computational paradigm in neuroscience is conceptually underspecified. That is, the field is insufficiently clear about what it means to talk about the computations that the brain is performing” (p. 1).

675.See Bell (1999), Hanson (2011), and Lee (2011) for some discussion.

676.E.g., we can talk about how many FLOP/s it takes to run an EfficientNet-B2 at 10 Hz, given a description of the model.

677.See Piccinini (2017) for discussion of related issues.

678.For an example of the types of debates in this vein that do not seem to me particularly relevant or productive in this context, see here.

679.From Open Philanthropy’s non-verbatim notes from a conversation with Dr. Paul Christiano: “Attempting to use some standard like “the description of the system you would give if you really understood how the system worked” might well result in over-estimates, since it would plausibly result in descriptions at lower levels, like transistors or NAND gates” (p. 8).

680.This definition is based on the definition of when one computational method represents another offered by Knuth (1997), p. 467, problem 9. See also Sandberg and Bostrom (2008): “A strict definition of simulation might be that a system S consists of a state x(t) evolving by a particular dynamics f, influenced by inputs and producing outputs: x(t+1) = f(I,x(t)), O(t)=g(x(t)). Another system T simulates S if it produces the same output (within a tolerance) for the same input time series starting with a given state (within a tolerance): X(t+1)=F(I, X(t)), O(t)=G(X(t)) where |x(t)‐X(t)|< ε1 and X(0)=x(0)+ ε2. The simulation is an emulation if F=f (up to a bijective transformation of X(t)), that is, the internal dynamics is identical and similar outputs are not due to the form of G(X(t)).”

681.See e.g. Sandberg and Bostrom (2008), who note that the brain is not strictly simulable on their definition, due to chaotic dynamics, but that “there exists a significant amount of noise in the brain that does not prevent meaningful brain states from evolving despite the indeterminacy of their dynamics. A “softer” form of emulation may be possible to define that has a model or parameter error smaller than the noise level and is hence practically indistinguishable from a possible evolution of the original system” (p. 7).

682.E.g., whether a given method of transitioning between states in a way that doesn’t map to the brain is OK or not will depend on whether this is construed as part of the “algorithm” or part of its “implementation.” But implementation itself takes place at many levels of abstraction, which can themselves be described in algorithmic terms.

683.See this post by AI impacts for a framework somewhat reminiscent of this conception, which plots indifference curves for different combinations of hardware and software sophistication. The post treats the brain as the point that combines “human-level hardware” and “evolution level software engineering.” But we can also imagine defining human-level hardware as the amount of hardware that someone with “evolution level software engineering skill” would need in order to create a computational system that matches human-level task performance. My thanks to Paul Christiano, Katja Grace, and Ajeya Cotra for discussion of this approach.

684.See discussion Schneider and Gersting (2018) (p. 96-100): “To measure time efficiency, we identify the fundamental unit (or units) of work of an algorithm and count how many times the work unit is executed” (p. 96). From Open Philanthropy’s non-verbatim notes from a conversation with Dr. Jess Riedel: “In the context of a computational system, you can think of an ‘operation’ as a small computation that can be treated as atomic, at least with respect to a particular architecture” (p. 5).

685.See e.g. Thagard (2002), who chooses to count proteins instead of neurons.

686.If we construe the type of task-performance at stake in the “no constraints” option above as including any task the brain can perform in the sense at stake here, then the two collapse into each other. However, my sense is that when people talk about matching human-level task-performance, they generally have in mind the type of task-performance humans do in fact display, rather than the type of task-performance they could display in principle if “programmed” with arbitrary skill.

687.My thanks to Ajeya Cotra for discussion.

688.Strictly, they would need to correspond to the neurons and synapses in a particular human brain; but as I noted in Section 1.5, at the level of precision relevant to this report, I’m treating normal adult human brains as equivalent.

689.This is meant to exclude the possibility of using some other part of the model to do what is intuitively “all of the work,” but in some hyper-efficient manner.

690.In particular, despite the amount of evidence discussed in the report, I don’t think of these probabilities as particularly “robust.” Even in the final stages of this project, they’ve continued to vary somewhat as I’ve been exposed to new evidence, and as different considerations have become more or less salient to me (for example, whether 1e15 has fallen above or below my median has varied), and I expect that they will continue to do so, especially in response to more data about expert opinion. The numbers offered here are just a coarse-grained snap-shot. I’ve also erred on the side of round numbers to avoid suggesting too much precision.

691.The estimate can be seen as keyed to a concept that combines “just pick a degree of brain-like-ness” with “reasonably brain-like.” It has the disadvantages of both – namely, arbitrariness and vagueness.

692.See Izhikevich (2004) (p. 1066); and the chart in Section 2.1.2.3.

693.See endnotes in Section 2.1.2.4 for examples.

694.See endnotes in Section 2.1.2.4.

Subscribe to new blog alerts

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.